AI Agents in Production

The Demo-to-Production Gap

Every AI agent demo looks magical. The agent plans, reasons, uses tools, and delivers perfect results in 30 seconds. Then you deploy it to production and it hallucinates API calls, loops infinitely, and costs $47 per request.

I’ve helped multiple clients cross this gap. Here’s what works.

Pattern 1: Supervisor Architecture

The most reliable pattern for production agents. A supervisor LLM orchestrates specialized workers:

class SupervisorAgent:
    def __init__(self):
        self.workers = {
            'research': ResearchWorker(),
            'code': CodeWorker(),
            'review': ReviewWorker(),
        }
        self.max_steps = 10
        self.budget_limit = 0.50  # Max cost per request

    async def execute(self, task: str):
        plan = await self.plan(task)
        results = []
        cost = 0.0

        for step in plan.steps[:self.max_steps]:
            if cost > self.budget_limit:
                return self.graceful_degradation(results)

            worker = self.workers[step.worker]
            result = await worker.execute(step.instruction)
            cost += result.cost
            results.append(result)

            # Supervisor evaluates: continue, retry, or stop?
            evaluation = await self.evaluate(step, result)
            if evaluation.action == 'stop':
                break
            elif evaluation.action == 'retry':
                result = await worker.execute(step.instruction, feedback=evaluation.feedback)

        return self.synthesize(results)

Key decisions:

Hard step limit — agents that can loop forever, will
Cost budget — prevent runaway API charges
Graceful degradation — return partial results when budget is exceeded
Evaluation after each step — catch errors early

Pattern 2: Human-in-the-Loop Checkpoints

For high-stakes workflows (financial decisions, infrastructure changes), insert human approval gates:

class ApprovalGate:
    async def check(self, action, context):
        risk = self.assess_risk(action)

        if risk == 'low':
            return True  # Auto-approve
        elif risk == 'medium':
            # Async approval — Slack/email notification
            approval = await self.request_approval(
                channel='#ai-actions',
                message=f"Agent wants to: {action.description}\nContext: {context}",
                timeout_minutes=30
            )
            return approval.approved
        else:
            # High risk — always require human
            return await self.require_human_takeover(action, context)

I wrote more about human oversight patterns in the context of the EU AI Act compliance — the principles apply to any production agent.

Pattern 3: Tool Sandboxing

Never let an agent execute tools with production credentials directly:

class SandboxedToolExecutor:
    def __init__(self):
        self.allowed_tools = {'search', 'read_file', 'calculate'}
        self.blocked_patterns = [
            r'rm\s+-rf',
            r'DROP\s+TABLE',
            r'DELETE\s+FROM',
        ]

    async def execute(self, tool_name, params):
        if tool_name not in self.allowed_tools:
            raise ToolNotAllowed(f"{tool_name} is not permitted")

        # Check for dangerous patterns in params
        for pattern in self.blocked_patterns:
            if re.search(pattern, str(params), re.IGNORECASE):
                raise DangerousOperation(f"Blocked pattern: {pattern}")

        # Execute in isolated environment
        result = await self.sandbox.run(tool_name, params, timeout=30)
        return result

Pattern 4: Observability First

You can’t debug agents without traces. Use OpenTelemetry to trace every LLM call, tool invocation, and decision:

from opentelemetry import trace

tracer = trace.get_tracer("ai-agent")

async def agent_step(self, instruction):
    with tracer.start_as_current_span("agent_step") as span:
        span.set_attribute("instruction", instruction)

        # LLM call
        with tracer.start_as_current_span("llm_call"):
            response = await self.llm.generate(instruction)
            span.set_attribute("tokens_used", response.usage.total)
            span.set_attribute("cost", response.cost)

        # Tool execution
        if response.tool_calls:
            with tracer.start_as_current_span("tool_execution"):
                for tool_call in response.tool_calls:
                    result = await self.execute_tool(tool_call)
                    span.set_attribute(f"tool.{tool_call.name}.result", str(result)[:500])

        return response

For Kubernetes-based deployments, I cover the monitoring stack in detail at Kubernetes Recipes — the same Prometheus + Grafana patterns work for agent observability.

The Reliability Checklist

Before deploying any AI agent to production:

☐ Step limit (max iterations before forced stop)
☐ Cost budget (max spend per request)
☐ Timeout per tool call (30s default)
☐ Human approval gates for high-risk actions
☐ Tool sandboxing (allowlist, not blocklist)
☐ Full observability (traces, costs, latency)
☐ Graceful degradation (partial results > errors)
☐ Rate limiting (per user and global)
☐ Fallback to simpler logic when agent fails
☐ Automated testing with deterministic scenarios

AI agents in production are 20% prompt engineering and 80% systems engineering. Treat them like any other distributed system — with circuit breakers, retries, and monitoring.

AI Agents in Production: Patterns That Work

The Demo-to-Production Gap

Pattern 1: Supervisor Architecture

Pattern 2: Human-in-the-Loop Checkpoints

Pattern 3: Tool Sandboxing

Pattern 4: Observability First

The Reliability Checklist

Related Articles

Reliable AI Agents in Java with LangChain4J — Workshop

AI Gateway on Kubernetes: Route and Load-Balance LLM Traffic

AI Model Serving on K8s: vLLM vs Triton vs NIM (2026)

AI Observability on Kubernetes: Monitor LLM Performance