How do you maintain AI agents so they don’t break after 90 days?
AI agent maintenance means treating agents like production software: version your prompts/tools, monitor every run, write contract tests for APIs and data formats, and schedule regular model + dependency reviews. Most “it broke” incidents come from silent upstream change, not your logic. Build for observability and rollback from day one.
Why this matters for ops + eng leaders at 10–200 person teams
If you’re running agents in customer-facing workflows or internal ops, a “small” failure doesn’t stay small.
- A lead enrichment agent silently changes a field name. Your routing rules fail. SDRs stop getting leads.
- A browser automation breaks after a UI update. Now invoices don’t get reconciled. Month-end slips.
- A model update shifts output format just enough to break downstream parsing. Your CRM fills with garbage.
At this company size, you don’t have a dedicated SRE team watching automations. The automation either runs… or it becomes one more thing your best people babysit.
But after 90 days, reality shows up: APIs change, data shifts, auth expires, vendors tweak UI, and edge cases accumulate. Maintenance becomes the hidden tax.
Actionable steps: a practical AI agent maintenance system
You don’t need a big process. You need a repeatable one.
1) Put your agent on a “software contract”
Define what “correct” looks like, in writing.
- Inputs: required fields, allowed formats, max payload size
- Outputs: schema, required keys, error codes, confidence signals
- Side effects: which systems it may write to, and which fields it’s allowed to modify
If your agent outputs JSON, enforce it. If it outputs free text, wrap it with a parser + validator and treat validation failures as first-class errors.
2) Add observability before you add more capabilities
Minimum viable observability for agents:
- A run ID for every execution
- Structured logs: tool calls, inputs, outputs, latency, token usage
- Failure reasons: validation failed, tool 429, auth expired, selector missing, etc.
- Alerting on error rate and “stuck” runs
If you can’t answer “what changed?” within 10 minutes, you don’t have an agent. You have a liability.
3) Version everything that can change
Pin versions where possible. Track the rest.
- Prompt templates and system instructions
- Tool schemas
- Model name + parameters
- Dependency versions
- Integration config (field mappings, IDs, selectors)
Treat prompt changes like code changes. PR it. Review it. Ship it.
4) Write the tests that catch 90-day failures
You want tests that fail when the world changes, not only when your code changes.
Recommended test types:
- Contract tests (APIs): call the upstream API in a sandbox and assert key fields still exist.
- Golden-run tests (LLM): keep a small set of representative inputs and verify the output still validates.
- Data-format tests: verify the shape of inbound data from your warehouse/CRM/export hasn’t drifted.
- Browser smoke tests: run one canary flow daily and alert on selector/step failure.
If you only test on deploy, you’ll miss the breakage that happens on Tuesday because someone else shipped.
5) Build rollback + “safe mode” paths
Rollbacks are what make maintenance cheap.
- Keep the last known-good prompt/tool config
- Support a “no-write” mode for CRMs and billing systems
- On repeated failure, switch to a fallback: queue for human review or a simpler deterministic flow
A reliable agent sometimes says: “I can’t safely do this right now.”
6) Assign ownership and schedule maintenance
Create a lightweight cadence:
- Weekly (15 minutes): check error rate, top failure modes, alerts, and “unknown unknowns”
- Monthly (60 minutes): review upstream API changes, vendor UI changes, auth/permissions drift
- Quarterly (90 minutes): re-run eval set, reassess model choice, refactor brittle steps
Also: make one person the owner. Not a committee. A name.
A maintenance checklist you can steal
- Area: API integrations | What breaks: field removed/renamed, auth expiry | What to do: contract tests + alert on schema changes | Frequency: daily/weekly
- Area: Browser automation | What breaks: UI change, selector drift | What to do: canary run + resilient selectors | Frequency: daily
- Area: LLM outputs | What breaks: format drift, refusal drift | What to do: schema validation + golden tests | Frequency: weekly
- Area: Data inputs | What breaks: missing fields, new nulls | What to do: data validation + upstream checks | Frequency: weekly
- Area: Costs | What breaks: token spikes, runaway loops | What to do: budgets + rate limits + anomaly alerts | Frequency: weekly
What most teams get wrong
They optimize for “it worked once”
A demo is a controlled environment. Production is adversarial.
If your agent only works when:
- the data is perfect,
- the UI doesn’t change,
- the model behaves exactly the same,
…then it’s not automation. It’s a fragile script with a chatbot bolted on.
They treat prompts like copy instead of code
Prompt edits feel harmless, so teams change them casually. That’s how you get silent regressions.
If the prompt determines a downstream write into HubSpot, Salesforce, or a database, it is code. No exceptions.
They add more tools instead of adding reliability
More tools increase the surface area for failure: more APIs, more auth, more rate limits, more timeouts.
A boring, observable agent beats a magical one that breaks.
Bottom line
“Set it and forget it” was always a lie.
If you want agents that survive past the first quarter, build them like software: contracts, tests, monitoring, and rollbacks. Then maintenance is predictable and cheap.
If you want a second set of eyes on your agent architecture, observability, and maintenance plan, book a call and I’ll tell you what I’d fix first: https://calendar.app.google/fvvhoEcfBzupGyC27
Sources
- https://www.reddit.com/r/automation/comments/1ja2hxi/what_are_the_biggest_challenges_in_ai_automation/
- https://www.reddit.com/r/n8n/comments/1mg0z79/is_anyone_else_tired_of_ai_agents_that_dont/
- https://www.reddit.com/r/AI_Agents/comments/1r6t1vc/ive_been_running_ai_agents_247_for_3_months_here/
- https://www.reddit.com/r/automation/comments/1nphndt/how_are_you_automating_repetitive_browser_tasks/
- https://www.reddit.com/r/AI_Agents/comments/1ovk0lx/can_we_talk_about_why_90_of_ai_agents_still_fail/
- https://news.ycombinator.com/item?id=47039354
- https://www.anthropic.com/engineering/building-effective-agents
Frequently Asked Questions
I reply to all emails if you want to chat: