AI agents in production: a 2026 checklist
Over twelve months we shipped nine agents for retail, FinTech and logistics. What breaks most often, which stack to pick, and why a GPT wrapper isn't an agent.
Creastra Digest
- An agent = tools + memory + checklists; a prompt alone is not an agent
- You spend 80% of effort on guardrails and observability, not the model
- Quality measurement: a 50-case golden set, refreshed each sprint
By early 2026 we run nine agents in production. One triages support tickets at a bank, another writes marketplace product copy, a third reconciles warehouse inventory in real time. None of them is a GPT wrapped in a system prompt — they're full services with their own lifecycle.
1. How an agent differs from a chatbot
An agent solves a task, a chatbot holds a conversation. That means: tools (search, code, DB), memory (short + long-term) and an explicit success-criteria checklist. Skip any one of those and you have a demo, not a service.
2. The stack that survived production
- Models: GPT-5 + Claude Opus 4.6 + local Qwen 3 — each on its own task class
- Orchestration: LangGraph for long-running processes, Inngest for cron agents
- Memory: pgvector for semantics, Redis for episodes, Postgres for facts
- Observability: Langfuse + our own grader agent
3. What breaks most often
Not the model. Frontier models drift by single-digit percent per quarter. What breaks: integrations (an API changed shape), policies (a new rule landed but the prompt wasn't updated) and user expectations (yesterday was "describe", today is "sell").
4. How to measure quality
Take 50 real inputs from production (not synthetic). Label the gold answer. Run the agent daily, score on two axes: correctness and completeness. Refresh the golden set every sprint — that's the quality contract you offer the client.