We are all going to have to become engineers

What tinkering with agents at home taught me about deploying them at work

Barnaby avatar

Author

Date

Skilled mechanic examining the undercarriage of a car lifted on a hydraulic lift in an auto repair shop, ensuring vehicle safety and maintenance.
Five months on the AI hype cycle
Five months on the AI hype cycle.

What could possibly go wrong?

I didn’t expect to spend half my life debugging a steaming pile of cron jobs and unruly agents when I started tinkering with agentic tools back in January. Yet here I am.

It begins like this.

With trepidation, you install Codex. Or Claude Code. You figure out the command line. You download skills from GitHub. You ask the agent to use one. It works.

Amaze1.

You download a shittonne of skills, half of which you’ll never use. You get Claude to install OpenClaw on a server. You hook it up to your X, your newsletters, your personal email. It starts actually doing things.

You learn cron jobs. You learn hooks. You learn YAML at 11pm on a Tuesday.

You learn to dangerously skip permissions. Yolo mode.

You hear about Hermes. You get OpenClaw to install Hermes. You get Hermes busy.

You are an AI guru.

You wake up. Tuesday. Maybe Wednesday. Your server is a mess. Your AGENT.md is as long as Crime and Punishment and about as readable. Cron jobs are firing into each other. Something is sending the same email every nine minutes. Something else has been quietly rate-limiting your API key since Sunday. You have a horrendous bill from Anthropic. OpenClaude is down. Hermes is down. Claude Code is sitting there blinking, waiting for follow on instructions on a session you no longer remember having.

You open the logs. The logs are 40MB.

You spend the morning debugging. Then the afternoon. Then the evening. You are not building anything. You are not thinking anything. You are reading stack traces and reverting commits and asking an agent to fix what another agent did to fix what you did three weeks ago and can no longer recall.

You are not an AI guru.

You are a sysadmin. For yourself. Unpaid.

It’s not just you. Enterprises on the bleeding edge are bleeding too. In December an Amazon coding agent deleted a live AWS production environment and took mainland China down for thirteen hours. In April a Cursor agent on Claude wiped a company database and its backups in nine seconds.

Becoming an engineer

So you start engineering. You have already been a smaller version of one of those stories. Eighteen months ago you thought software engineers were finished. Look at you.

Fences first.

So you build small. Allowlists. Confirmation gates. Paused jobs. Narrow scopes. The first job is not making the agent smarter. It is making the blast radius smaller. You AB test everything.

You make sure a hook in settings.json outlives the session. A cron entry fires at 7am whether you log on or not. You start mapping out process workflows and system architecture. A skill owns a workflow and inherits its rules from a markdown file you can read in your luxurious balinese bath. Prompts evaporate. Hooks outlive you. Skills outlive the model.

You stop expecting one model to do two jobs at once. The brand palette is fixed. The slide master is fixed. The copy is creative. Ask one prompt to do all three and it quietly breaks the half that looks fine until you open the file in PowerPoint and your logo is two centimetres into the bleed. A skill enforces the rules; the model fills the content; the two never trade jobs. Same shape applies to type-signed code, vault notes with frontmatter, and corporate memos with house formatting.

You prune. Every weekend you read your own CLAUDE.md and cut the rules you wrote at midnight that contradict the ones you wrote last Sunday. Context engineering — what is loaded, in what order, with what anchoring — is the skill nobody is training for yet. I appreciate your endeavours, learning departments, but we really don’t need training on prompting.

Slop is structural.

You stop trusting the output by sight or assuming it is fine based on the first paragraph. AI-slop is easy to spot. Once you can see it, you can see it everywhere. The tells that annoy me the most are the X-not-Y, twin parallel-verb sentences, phantom contrast sentences. Yuk. The regex-able vocabulary (delve, leverage, robust) is the small part. I actually quite like an em-dash (and this predates AI so i’m keeping mine). Unfortunately rules do not survive model weights, and the only fix is threading write-time discipline into every subagent prompt. The editorial canon I run my subagents against is the most important and most-used file in my agents repo.

You stop trusting any single vendor. Anthropic shut OAuth-for-agents one April morning and my Max plan was useless for the agentic server fleet. I moved the work to open-source models. They were all crap. It was a relief when OpenAI shipped OAuth for Codex 5.5.

The lesson: Architect for model portability. Never rely on one vendor. Model routing as config the runtime reads at startup. Hosted models for general work and an open-weight option kept warm for the day a sensitive workload needs to stay on-prem.

You learn to build out skills in families instead of one-offs. Forty random skills with no architecture is a hammer collection that solves nothing. You build named stacks, shared modes, common file layouts, one per discipline. Munger’s “if all you have is a hammer, everything looks like a nail” is ringing in your ears.

You learn to tier the work and manage your tokens. The orchestrator gets the smartest model. Bulk and mechanical work go to cheaper models in sub-workspaces. Tools like cmux let one orchestrator drive several Sonnet or Haiku workers in parallel. Pinning skills to weaker models on the main thread degrades the planning seat with no upside.

You stop letting the model guess where it is in a long workflow. A consulting engagement runs for months: research, draft, critique, revise, publish. If the model has to infer the phase each turn, it gets it wrong half the time. The fix is mode as a first-class concept. A skill like /bdeals reads its mode (gtm, analyse, fieldwork, report) every turn and refuses to act outside it. Phase-blindness is the root cause of half the workflow failures I have seen.

The taste seat stays human.

You stop trusting the model’s answer to is this good? There was a running joke in my old firm about a colleague who would switch position the instant he read the most senior person in the room. AI does the same. Ask is-this-good and the model gives you yes. Producing five slide layouts is a five-second task. Knowing which is the right one still costs the same hour it always did. The taste seat does not transfer.

Meanwhile, who is actually achieving productivity gains?

Every large organisation feels they are not moving fast enough on AI. They are right.

OpenAI’s May 2026 usage data shows that of US users doing entrepreneurial work with ChatGPT, only about 5% are tech startups. The rest are plumbers, agency owners, dentists, Etsy sellers. The places where one person can tinker with AI tooling without causing global havoc. The dentists are shipping useful things every week while bluechips are writing strategy decks for future deployment sometime in 2027. My belief is the headcount cuts at large companies attributed to AI are mostly excused for legacy fat reduction.

With great power comes great responsibility

Enterprises are stuck for a few reasons that compound. The legit one is they need to be careful because their blast radius is so big. A great deal can go wrong and probably will.

But it’s also true that most large corporations have not even begun fixing their underlying data stack. Without that groundwork, workflows cannot improve and agents cannot generate insights. Most have not thought about how agents will interact across the org either.

It also doesn’t help that the tooling many bluechips have bought is two years behind the frontier. Microsoft craftily signed up corporations to multi-year Copilot lock-ins (up for 4 mother-ducking years!). If you work in procurement and are signing enterprise AI contracts this year, do your company a favour and sign monthly. Quarterly at the outside. The frontier is moving too fast to be locked to an obsolete model for a year, let alone four.

Bring on Founder mode, baby!

The deeper problem is structural. Brian Chesky has been arguing publicly that founder mode — running the company in the details, ratifying every call personally — is the only viable operating mode in the age of AI. “If you’re risk-averse, you want to be incremental, those types of people are not going to survive the age of AI.”

I think very quickly every enterprise will need to redesign itself from scratch. Most cannot pivot. They have an incumbent business to protect, and a managerial class trained to defend it. Most of those managers have never built anything themselves. Most of the leaders approving AI cheques or espousing the need to accelerate AI have never opened a terminal. That is a big problem.

FDEs — half the answer

Forward Deployed Engineers, the buzziest role in enterprise AI right now, are half the answer.

FDEs are software engineers who build and tune the agents in the middle of the journey. The CEO / CFO conversation sits at the front and is led by sector and functional experts. Change management sits at the back. FDEs can do neither. The shape that actually delivers transformation is strategy and finance at the top, FDEs in the middle, change management at the back. The AI-lab-backed consultancies announced last fortnight are FDE shops with a consulting badge slapped on. Missing both ends, locked to one model family, charging consulting day rates. I think they will struggle.

A great deal can still go wrong, and probably will. Every multi-year Copilot lock-in. Every PDF-first workflow an agent cannot read. Every model-vendor monoculture. Every CIO who cannot tell you what a hook is. Every AI transformation that begins with a strategy slide and ends with a chatbot. Every procurement team still ordering 16GB laptops when the new floor is 64GB. Every CLAUDE.md no one has pruned in six months.

We are all going to have to become engineers

You wake up. Friday this time. The server is quiet. Hermes is running and pruning my knowledge databases. OpenClaw is sending my tweet, email and newsletter digests. Claude Code is doing what Claude Code is for. You have become an engineer2.

The CLI is back.

Anyone whose work passes through a digital interface — which means everyone — will have to know what a skill is, what a hook does, what context engineering means, and why blast radius matters. The CLI is back as the primary interface. A decade of WYSIWYG users will have to learn how to operate a computer again.

IQ vs EQ across frontier and open-weight models
IQ vs EQ across frontier and open-weight models.

And you learn that benchmarks miss the point. You spend all day talking to an agent — you want one with EQ and IQ both. Claude Opus is popular because it is smart and has the best EQ. Codex has more raw IQ but less personality — fine for unattended overnight runs, less ideal as a daily driver. I have tried Qwen, Kimi, GLM, Minimax, Gemma. Each looks fine on a leaderboard until you have to live with it. On real tasks, the gap between Claude/Codex and the rest is wider than benchmark numbers imply — see aiiq.org for one quantification. Pay the upgrade. Knowing which model to use when is itself an engineering skill.

I am one person doing this and finding it hard. The corporations that have not started are running out of time to find it hard cheaply.


  1. Project Hail Mary dad joke. 

  2. unpaid.