The Black Box Problem
Your AI feature is slow. Users complain about wrong answers. Costs spiked 3x last week. But you canβt explain why because LLM calls are opaque β you know the input and output but nothing about what happened in between.
This is the observability gap that kills AI products.
OpenTelemetry for LLMs
OpenTelemetry (OTel) is the standard for distributed tracing. It works for LLM applications too β with some AI-specific instrumentation.
Basic LLM Tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-service")
class TracedLLMClient:
def __init__(self, client):
self.client = client
async def generate(self, messages, model="gpt-4o"):
with tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.provider", "openai")
span.set_attribute("llm.input_tokens", self._estimate_tokens(messages))
start = time.monotonic()
response = await self.client.chat.completions.create(
model=model,
messages=messages
)
latency = time.monotonic() - start
# Record metrics
span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
span.set_attribute("llm.total_tokens", response.usage.total_tokens)
span.set_attribute("llm.latency_ms", int(latency * 1000))
span.set_attribute("llm.cost_usd", self._calculate_cost(model, response.usage))
span.set_attribute("llm.finish_reason", response.choices[0].finish_reason)
return responseAgent Workflow Tracing
For multi-step agent workflows, nest spans to see the full execution tree:
async def agent_workflow(self, user_query):
with tracer.start_as_current_span("agent.workflow") as root:
root.set_attribute("user.query", user_query[:200])
# Step 1: Planning
with tracer.start_as_current_span("agent.plan"):
plan = await self.planner.create_plan(user_query)
# Step 2: Execute steps
for i, step in enumerate(plan.steps):
with tracer.start_as_current_span(f"agent.step.{i}") as step_span:
step_span.set_attribute("step.type", step.type)
step_span.set_attribute("step.tool", step.tool_name)
if step.type == "llm_call":
result = await self.traced_llm.generate(step.messages)
elif step.type == "tool_call":
with tracer.start_as_current_span("tool.execute"):
result = await self.tools.execute(step.tool_name, step.params)
# Step 3: Synthesize
with tracer.start_as_current_span("agent.synthesize"):
final = await self.synthesize(results)
root.set_attribute("agent.total_steps", len(plan.steps))
root.set_attribute("agent.total_cost", self.accumulated_cost)
return finalIn Jaeger or Grafana Tempo, this renders as a trace tree showing exactly where time and tokens were spent.
Key Metrics to Track
Cost Tracking
# Cost per model (as of 2026, update regularly)
COST_PER_1K_TOKENS = {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-opus-4": {"input": 0.015, "output": 0.075},
"claude-sonnet-4": {"input": 0.003, "output": 0.015},
}
def calculate_cost(model, usage):
rates = COST_PER_1K_TOKENS.get(model, {"input": 0, "output": 0})
input_cost = (usage.prompt_tokens / 1000) * rates["input"]
output_cost = (usage.completion_tokens / 1000) * rates["output"]
return round(input_cost + output_cost, 6)Quality Signals
# Track response quality indicators
span.set_attribute("quality.has_code", bool(re.search(r'```', response_text)))
span.set_attribute("quality.response_length", len(response_text))
span.set_attribute("quality.contains_apology", 'sorry' in response_text.lower())
span.set_attribute("quality.finish_reason", finish_reason) # 'stop' vs 'length' (truncated)Grafana Dashboard
For the monitoring stack setup, I use the same Prometheus + Grafana patterns I detail at Kubernetes Recipes. Key panels:
Row 1: Overview
- Total LLM calls/min
- Total cost (rolling 24h)
- P95 latency
- Error rate
Row 2: Cost Breakdown
- Cost by model (pie chart)
- Cost by feature (bar chart)
- Cost trend (7-day line)
- Projected monthly cost
Row 3: Performance
- Latency by model (heatmap)
- Token usage distribution
- Cache hit rate
- Rate limit events
Row 4: Quality
- Finish reason distribution
- Average response length trend
- User feedback scores
- Retry rateAutomating with Ansible
For teams deploying the OTel collector across multiple environments, I use Ansible playbooks to standardize the setup. The infrastructure-as-code approach I teach at Ansible Pilot applies directly β treat your observability stack like any other managed service:
- name: Deploy OTel collector for AI services
hosts: ai_services
roles:
- role: otel-collector
vars:
exporters:
- type: otlp
endpoint: tempo.monitoring:4317
- type: prometheus
endpoint: "0.0.0.0:8889"
processors:
- type: batch
timeout: 5s
- type: attributes
actions:
- key: environment
value: production
action: upsertThe ROI of AI Observability
Without observability:
- Debugging a bad AI response: 2-4 hours (reproduce, guess, retry)
- Finding cost spikes: manual bill review, days later
- Explaining AI behavior to stakeholders: impossible
With observability:
- Debugging: 5 minutes (trace the exact request)
- Cost spikes: real-time alerts
- Explainability: βhereβs the trace showing every step the AI tookβ
Instrument from day one. The cost of adding observability later is 10x the cost of building it in.
