What could possibly go wrong?
I didn’t expect to spend half my life debugging a steaming pile of cron jobs and unruly agents when I started tinkering with agents back in January. Yet here I am.
It begins like this.
It’s Saturday. You open the command line, paste in a bash command to install Claude Code. Or Codex. You download skills from GitHub. You ask the agent to use one. It works. You ask the agent to reorganise the mess of files on your computer. It does it perfectly.
Amaze1Project Hail Mary dad joke..
You download a shittonne of skills, half of which you’ll never use. You get Claude to install OpenClaw on a server. You hook it up to your X, your newsletters, your personal email. It starts actually doing things.
You learn cron jobs. You learn hooks. You learn YAML at 11pm on a Tuesday.
You learn to dangerously skip permissions. Yolo mode.
You hear about Hermes. You get OpenClaw to install Hermes. You get Hermes busy.
You are an AI guru.
You wake up. Wednesday. Maybe Thursday. Your server is a mess. Your AGENTS.md is as long as Crime and Punishment and about as readable. Cron jobs are firing into each other. Something is sending the same email every nine minutes. Something else has been quietly rate-limiting your API key since Sunday. You have a horrendous bill from Anthropic. OpenClaw is down. Hermes is down. Claude Code is sitting there blinking, waiting for follow-on instructions on a session you no longer remember having.
You open the logs. The logs are 40MB.
You spend the morning debugging. Then the afternoon. Then the evening. You are not building anything. You are not thinking anything. You are reading stack traces and reverting commits and asking an agent to fix what another agent did to fix what you did three weeks ago and can no longer recall.
You are not an AI guru.
You are a sysadmin. For yourself. Unpaid.
Enterprises on the bleeding edge are bleeding too. The pattern is the same at every scale: once an agent can act, the hard problem moves from intelligence to control. In December, Amazon’s Kiro coding agent reportedly deleted a live AWS production environment and caused a 13-hour AWS service outage in China2PC Gamer on the AWS/Kiro report.. In April, a Cursor agent on Claude wiped a company database and its backups in nine seconds3Tom’s Hardware on Cursor/PocketOS..
Becoming an engineer
You have already been a smaller version of one of those stories. So you reluctantly start learning engineering.
Learning engineering discipline
You learn to build small. Crawl-walk-run. No more one-shot Cathedral plans. You build allowlists. Confirmation gates. Paused jobs. Narrow scopes. The first job is reducing blast radius. You AB test everything.
You make sure a hook in settings.json outlives the session. A cron entry fires at 7am whether you log on or not. You start mapping out process workflows and system architecture. A skill owns a workflow and inherits its rules from a markdown file you can read in your luxurious Balinese bath. Prompts evaporate. Hooks outlive you. Skills outlive the model.
After too many failed PowerPoint runs, the words “deterministic” and “stochastic” re-enter your vocabulary. The brand palette, the slide master, font sizes, chart production, the underlying financial analysis — all deterministic. They need to produce the same result every time. They belong in Python scripts that obey the rules every time. The narrative, the framing, the emphasis, the visuals are creative, random, stochastic — that is what the models are good at. Ask one prompt to do all of it and it quietly breaks the deterministic half: open the file and your logo is two centimetres into the bleed. The pattern that works is a mixture: Python scripts own the rules, the model owns the fill.
You prune. Every weekend you read your own CLAUDE.md and cut the rules you wrote at midnight that contradict the ones you wrote last Sunday. Context engineering — what is loaded, in what order, with what anchoring — is the skill nobody is training for yet. Load an old rule after a new one and the agent obeys the wrong policy. Put brand rules below task instructions and it ignores the palette. Let three stale CLAUDE.md files compete and you get a system that is obedient, confident and wrong.
The useful systems develop a retrospective habit. At the end of a messy session, the agent should inspect what went wrong: where I corrected it, where it improvised, where a skill was missing a step, where memory should have caught something. Then it proposes diffs. Specific lines, specific files, specific changes. This is how the system compounds instead of merely chatting.
Munger’s “if all you have is a hammer, everything looks like a nail” is ringing in your ears. You learn to build out skills in families instead of one-offs. Forty random skills with no architecture is a hammer collection that solves nothing. It also creates skill rot: overlapping descriptions, stale workflows, and agents guessing which instruction set applies. The fix is fewer, clearer routers — one skill per domain, with named workflows inside it.
Markdown, HTML, metadata. The boring stuff wins.
Then file-over-app becomes as philosophically important to you as Marcus Aurelius and Satoshi Nakamoto. Apps die. Platforms pivot. Software companies that were verbs twelve months ago suddenly look fragile. You want your knowledge to survive the AI apocalypse. You realise what a hack it is to give agents access to all your context. Then you learn that PDFs and PowerPoint files are where context goes to suffer. The formatting gets in the way. Plain text is different: you can read it, the agent can read it, and neither of you has to fight the container.
Then metadata becomes your mantra. Agents do not care that your folder structure once made sense to you. They care whether the file says what it is: date, status, topic, owner, source, workflow, links. File-over-folder starts to make sense too. Each note carries its own passport instead of relying on the right drawer. Dull little fields become the difference between finding the right source and confidently making things up.
Then you learn about RAG. Sadly, not red-amber-green, which would be easier to explain. Retrieval-augmented generation means making the model find the right source before it writes. Then come chunks, embeddings, vector stores, stale notes, bad matches, missing transcripts. Memory, it turns out, is engineering too.
Slop management is an ongoing battle
AI-slop becomes so easy to spot. You can viscerally feel when a human has put zero effort into curating their content, within seconds of scanning a post. Once you learn to see it, you can see it everywhere4The tells that annoy me the most are the X-not-Y, twin parallel-verb sentences, phantom contrast sentences.. Yuk. The vocabulary is a giveaway (delve, leverage, robust). I actually quite like an em-dash (and this predates AI so I’m keeping mine). Unfortunately rules do not survive model weights, and the only fix is injecting discipline into every subagent prompt, and frankly rewriting content yourself5My editorial review skill, which spots AI slop and humanises content, is the most important and most-used file in my github repo..
Juggling models and conserving tokens
You learn through trial and error that most benchmarks miss real-world use. Benchmaxxing is a thing. The frontier models are streets ahead of where the leaderboards put them. And EQ matters too — when you spend all day with an agent, you want both the IQ to solve the problem and the EQ to do it pleasantly. Claude is the fun smiley colleague. Codex feels quieter and more serious. I fundamentally believe this is why Claude is “winning the AI race”6A term I hate for reasons explained here..
You stop trusting any single LLM vendor. Anthropic shut OAuth-for-agents one April morning and my Max plan was useless for the agentic server fleet. I moved the work to open-source models. They were all crap as orchestrators. It was a relief when OpenAI shipped OAuth for Codex 5.5.
You learn that redundancy and portability is not optional. The fallback model has to be tested before the primary model fails. Do not build all your workflows around one model provider. The model frontier changes weekly — token prices are also in flux — you need to be able to swap out a model in a heartbeat for another without destroying your hard-built knowledge graphs and workflows.
You learn to tier the work and manage your tokens. The orchestrator gets the smartest model. Bulk and mechanical work go to cheaper models in sub-workspaces. Tools like cmux let one orchestrator drive several Sonnet or Haiku workers in parallel. Pinning skills to weaker models on the main thread degrades the planning seat with no upside.
You stop letting the model guess where it is in a long workflow. A consulting engagement runs for months: research, draft, critique, revise, publish. If the model has to infer the phase each turn, it gets it wrong half the time. The fix is building skills to autocascade updates to memory, and building a “prime” skill before starting a session so the agent has the latest context before it begins working.
Models start losing the plot long before the context window is full. The practical capacity is much smaller than the advertised number. The answer is a handoff at around 30% of context window: this writes the session state to disk, opens a fresh agent, makes it read the handoff, and forces it to propose next steps before acting.
Sycophant with no taste.
You stop trusting the model’s answer to is this good? There was a running joke about a colleague who would switch position the instant they read the view of the most senior person in the room. AI does the same. No backbone. Ask is-this-good and the model gives you yes. You learn to build a Charlie Munger skill/persona for your agents7Skills include Invert (“what would guarantee failure here?”), Bias Checks (“is the decision-maker (the user, or the team) currently exposed to this bias on this decision?”), Pre Mortem (“Assume the decision was made and the outcome was a disaster. Write the post-mortem now. What killed it? Be specific.”).
Producing five slide layouts is a five-second task. Knowing which is the right one still costs the same hour it always did. The taste seat does not transfer.
The command line is back.
Eighteen months ago you thought software engineers were finished. Look at you now.
Anyone whose work passes through a digital interface — which means everyone — will have to know what a skill is, what a hook does, what context engineering means, and why blast radius matters. The command line is back for anyone close to the machine. A decade of WYSIWYG users will have to learn how to operate a computer again.
Lessons for Enterprise
Why large companies are slow
Every large organisation feels they are not moving fast enough on AI. They are right.
OpenAI’s recent adoption data points in an uncomfortable direction: the largest pool of users actually getting productivity gains are solopreneurs and small businesses: agencies, dentists, plumbers and sellers who can change a workflow without convening a steering committee. The dentists are iterating and shipping useful things every week. Blue chips? Still writing strategy decks for deployment sometime in 2027.
Enterprises are stuck for a few reasons that compound. The legit one is they need to be careful because their blast radius is so big. A great deal can go wrong and probably will.
But it’s also true that most large corporations have not even begun fixing their underlying data stack and mapping and documenting all their existing processes. Without that groundwork, workflows cannot improve and agents cannot generate insights. Most have not thought about how agents will interact across the org either.
I appreciate your endeavours, learning departments, but we really don’t need training on prompting. We need training on engineering basics.
It also doesn’t help that the tooling many blue-chip companies have bought is two years behind the frontier. Yes, I’m talking about Copilot. Microsoft craftily signed up corporations to multi-year lock-ins when it was cutting edge in 2023 (up to four years, which is insane in this market). If you work in procurement and are signing enterprise AI contracts this year, do your company a favour and sign monthly. Quarterly at the outside. The frontier is moving too fast to be locked to an obsolete model for a year, let alone four. And while you’re at it, stop ordering 16GB RAM laptops with crap CPUs, when 32GB is the new floor for on-device small language models coming our way soon.
My belief is almost all “AI-driven” cuts at large companies are old restructuring plans wearing new clothes. If the workflow has not changed, the productivity gain has not arrived.
On forward deployed engineers
The model providers know enterprise adoption is hard. Last week, the large AI labs announced consultancy joint ventures packed with Forward Deployed Engineers8OpenAI launches the deployment company..
FDEs can build and tune agents. Useful, but incomplete. Enterprise adoption also needs strategy, finance, sector knowledge, change management, and a client with enough engineering literacy to operate the result. Otherwise you get DOGE with better slideware: clever people moving fast through systems they barely understand.
The new FDE houses will struggle because the build was never the whole problem.
Bring on Founder mode, baby!
The deeper problem is that most enterprise leaders were never founders. They learned to optimise an existing structure and manage people. They never built anything. The kind of full-scale pivot AI demands is foreign muscle.
Many senior approval chains are staffed by people who cannot inspect the work. They can approve a strategy deck. They cannot tell whether the agent has write access to production, whether the eval is real, or whether the workflow is just a chatbot with better styling.
Managing the change requires what Paul Graham called founder mode9Running the company in the details, ratifying every call personally, deeply understanding operational processes, client needs and feedback. Working with the teams on the ground. See Graham’s original essay.. Brian Chesky has been arguing publicly that the AI era demands what he calls AI founder mode — even more intense than the original: “If you’re risk-averse, you want to be incremental, those types of people are not going to survive the age of AI.”10Brian Chesky on Invest Like the Best with Patrick O’Shaughnessy, 5 May 2026. The quote starts at 32:11.
Chesky’s prescription is brutally simple: leaders closer to the work, fewer layers, small pilots before scaling, and managers learning the equivalent of coding in their own field.
We are all going to have to become engineers
You wake up. Friday this time. The server is quiet. Hermes is running and pruning my knowledge databases. OpenClaw is sending my tweet, email and newsletter digests. Claude Code is AI Search Engine Optimising my website. You have become an engineer11unpaid..
I am one person doing this and finding it hard. The corporations that have not started are running out of time to find it hard cheaply.
Notes
- Project Hail Mary dad joke. ↩︎
- PC Gamer on the AWS/Kiro report. ↩︎
- Tom’s Hardware on Cursor/PocketOS. ↩︎
- The tells that annoy me the most are the X-not-Y, twin parallel-verb sentences, phantom contrast sentences. ↩︎
- My editorial review skill, which spots AI slop and humanises content, is the most important and most-used file in my github repo. ↩︎
- A term I hate for reasons explained here. ↩︎
- Skills include Invert (“what would guarantee failure here?”), Bias Checks (“is the decision-maker (the user, or the team) currently exposed to this bias on this decision?”), Pre Mortem (“Assume the decision was made and the outcome was a disaster. Write the post-mortem now. What killed it? Be specific.”) ↩︎
- OpenAI launches the deployment company. ↩︎
- Running the company in the details, ratifying every call personally, deeply understanding operational processes, client needs and feedback. Working with the teams on the ground. See Graham’s original essay. ↩︎
- Brian Chesky on Invest Like the Best with Patrick O’Shaughnessy, 5 May 2026. The quote starts at 32:11. ↩︎
- unpaid. ↩︎


