Deploying AI agents: what nobody tells you

Everyone's talking about building AI agents. Nobody talks about what happens the week after you deploy one.

I've built and deployed a lot of them at this point. For BUILD+SHIP, for clients, for internal tools that nobody sees. And every single time, the build is the easy bit. The part that catches people out is everything that comes after.

Here's the stuff I wish someone had told me before I deployed my first one into a real workflow.

Your prompt will break in production

You spent three days refining the prompt. It works beautifully in testing. Every output is clean, structured, exactly what you wanted.

Then a real user sends something weird. Or the upstream data comes back in a slightly different shape. Or someone pastes a spreadsheet into the input field.

The agent doesn't fail gracefully. It either hallucinates something confident and wrong, or it just stops.

This happens to everyone. The gap between "works in testing" and "works with real data from real people" is wider than you think. I now build agents expecting them to fail on roughly 15% of real inputs, and I put logging in place from day one so I can see exactly what those failures look like.

You can't fix what you can't see.

Nobody will tell you when it goes wrong

This one stings.

One of my agents started producing subtly wrong output and nobody said anything. The team just worked around it. They noticed something was off, quietly started double-checking the outputs, and didn't bother flagging it because "that's just how it is."

Took me three weeks to notice, and only because I happened to look at the raw logs.

Build a feedback loop in from the start. Something as simple as a thumbs up / thumbs down button. Or a Slack message to you when the agent touches a specific type of input. You need humans in the loop. Not to do the work, but to tell you when the machine's getting it wrong.

The cost will surprise you

If you're using an API-based model (GPT-4, Claude, whatever), the cost per run feels negligible in testing. You're running 20 calls a day, it's a few pennies.

Then you deploy it and it runs 800 times a day. And the token count on real data is 3x what your test data was. And suddenly you're looking at a bill that didn't make it into anyone's budget conversation.

I now do cost-per-run calculations before I build anything. Not after. If the maths doesn't work at 10x volume, I rethink the design.

Also worth knowing: most workflows don't need GPT-4. I've moved a lot of tasks to cheaper, faster models and the output quality drop is negligible for structured tasks. The expensive model is for judgement calls. Everything else, use the cheaper one.

The handoff documentation doesn't exist

You built it. You understand it. You know why the prompt is structured the way it is, why it calls that particular API, why there's a fallback in step 3.

Nobody else does.

I shipped an agent for a client last year that I was pretty proud of. Clear logic, good test coverage, worked well. Three months later they had a new engineer join and couldn't explain to him how it worked or what to do when it broke. They rang me.

Write the documentation before you hand it over. Not a novel. One page. What does the agent do, what inputs does it expect, what does a good output look like, and what are the three most common failure modes. That's it.

It will drift

Models get updated. APIs change. The data structure upstream shifts slightly. The business process the agent was built around evolves.

Agents are not fire-and-forget. They need maintaining like any other piece of code. More than code, actually, because the failure modes are less obvious. They tend to degrade slowly rather than break cleanly.

I set a calendar reminder to review every agent I've built, once a month. Just a quick check: is it still running? Are the outputs still right? Has anything upstream changed? Takes 20 minutes. Saves a lot of pain.

None of this is a reason not to deploy agents. The value is real. An agent I built for a client in January is saving someone about 6 hours a week on a reporting process they hated. That's worth the maintenance overhead by a long way.

But go in with clear eyes. The build is the fun bit. The discipline is everything that comes after.