Why AI Observability Is Different
Traditional observability (RED metrics, golden signals) doesnβt capture AI-specific performance:
| Traditional Metrics | AI Metrics |
|---|---|
| Request latency | Time-to-first-token (TTFT) |
| Error rate | Hallucination rate |
| Throughput (req/s) | Token throughput (tok/s) |
| CPU/memory | GPU utilization, KV-cache |
| Uptime | Model accuracy drift |
The AI Observability Stack
βββββββββββββββββββββββββββββββββββββββββββββββ
β Grafana Dashboards β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββββββ
β Prometheus + Thanos β
ββββββββ¬ββββββββββββΌββββββββββββ¬βββββββββββββββ
β β β
ββββββββΌββββ ββββββΌβββββ ββββββΌββββββ
β vLLM β β DCGM β β Custom β
β /metrics β βExporter β β Exporter β
ββββββββββββ βββββββββββ ββββββββββββKey Metrics to Track
1. Inference Performance
# Prometheus rules for AI inference
groups:
- name: ai-inference
rules:
# Time to first token (TTFT)
- record: ai:ttft_seconds:p50
expr: histogram_quantile(0.50, rate(vllm_time_to_first_token_seconds_bucket[5m]))
- record: ai:ttft_seconds:p99
expr: histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))
# Inter-token latency
- record: ai:itl_seconds:p50
expr: histogram_quantile(0.50, rate(vllm_time_per_output_token_seconds_bucket[5m]))
# Token throughput
- record: ai:tokens_per_second
expr: sum(rate(vllm_generation_tokens_total[1m]))
# Request throughput
- record: ai:requests_per_second
expr: sum(rate(vllm_request_success_total[1m]))2. GPU Health
# GPU utilization
- record: ai:gpu_utilization:avg
expr: avg(DCGM_FI_DEV_GPU_UTIL)
# GPU memory used
- record: ai:gpu_memory_used_bytes
expr: DCGM_FI_DEV_FB_USED * 1024 * 1024
# GPU temperature
- record: ai:gpu_temperature_celsius
expr: DCGM_FI_DEV_GPU_TEMP
# GPU power usage
- record: ai:gpu_power_watts
expr: DCGM_FI_DEV_POWER_USAGE3. KV-Cache and Memory
# KV-cache utilization
- record: ai:kv_cache_usage_percent
expr: vllm_gpu_cache_usage_perc * 100
# Pending requests (queue depth)
- record: ai:pending_requests
expr: sum(vllm_num_requests_waiting)
# Active sequences
- record: ai:active_sequences
expr: sum(vllm_num_requests_running)Alerting Rules
groups:
- name: ai-inference-alerts
rules:
# TTFT degradation
- alert: HighTimeToFirstToken
expr: ai:ttft_seconds:p99 > 5
for: 2m
labels:
severity: warning
annotations:
summary: "P99 TTFT exceeds 5s (current: {{ $value }}s)"
runbook: "https://wiki.internal/runbooks/ai-ttft-high"
# GPU memory exhaustion
- alert: GPUMemoryNearFull
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} memory at {{ $value | humanizePercentage }}"
# Model serving errors
- alert: HighInferenceErrorRate
expr: |
sum(rate(vllm_request_failure_total[5m])) /
sum(rate(vllm_request_success_total[5m])) > 0.05
for: 3m
labels:
severity: critical
# KV-cache pressure
- alert: KVCachePressure
expr: ai:kv_cache_usage_percent > 90
for: 5m
labels:
severity: warning
annotations:
summary: "KV-cache at {{ $value }}% β requests may be rejected"
# GPU thermal throttling
- alert: GPUThermalThrottle
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 2m
labels:
severity: warningGrafana Dashboard
Panel Definitions
{
"panels": [
{
"title": "Token Throughput",
"type": "timeseries",
"targets": [{"expr": "sum(rate(vllm_generation_tokens_total[1m]))"}]
},
{
"title": "TTFT (P50 / P99)",
"type": "timeseries",
"targets": [
{"expr": "ai:ttft_seconds:p50", "legendFormat": "P50"},
{"expr": "ai:ttft_seconds:p99", "legendFormat": "P99"}
]
},
{
"title": "GPU Utilization per Pod",
"type": "heatmap",
"targets": [{"expr": "DCGM_FI_DEV_GPU_UTIL"}]
},
{
"title": "KV-Cache Usage",
"type": "gauge",
"targets": [{"expr": "avg(vllm_gpu_cache_usage_perc) * 100"}],
"thresholds": [60, 80, 90]
},
{
"title": "Cost per 1M Tokens",
"type": "stat",
"targets": [{"expr": "(sum(node_gpu_cost_per_hour) / sum(rate(vllm_generation_tokens_total[1h]))) * 1000000"}]
}
]
}Cost Tracking
Per-Request Cost Calculation
# Custom metrics exporter
from prometheus_client import Histogram, Counter, Gauge
token_cost = Histogram(
'ai_request_cost_dollars',
'Cost per inference request in dollars',
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)
# Calculate based on GPU-hour cost and tokens generated
GPU_COST_PER_HOUR = 3.25 # A100 spot price
GPU_COST_PER_SECOND = GPU_COST_PER_HOUR / 3600
def track_request_cost(duration_seconds, tokens_generated):
cost = duration_seconds * GPU_COST_PER_SECOND
token_cost.observe(cost)Cost Dashboard Queries
# Cost per 1M tokens (PromQL)
(
sum(rate(node_gpu_cost_hourly[1h])) /
sum(rate(vllm_generation_tokens_total[1h]))
) * 1000000
# Daily inference spend
sum(rate(node_gpu_cost_hourly[24h])) * 24
# Cost by model
sum by (model) (rate(ai_request_cost_dollars_sum[1h]))OpenTelemetry Integration
Auto-Instrumentation for AI Services
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: ai-instrumentation
spec:
python:
env:
- name: OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
value: ""
env:
- name: OTEL_TRACES_EXPORTER
value: otlp
- name: OTEL_METRICS_EXPORTER
value: prometheus
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector:4317Trace AI Request Flow
Client β API Gateway β Load Balancer β vLLM Pod
β β
β βββββββββββββββββββββββββββββββββββ β
β β Span: inference_request β β
β β βββ tokenize (2ms) β β
β β βββ prefill (150ms) β β
β β βββ decode (800ms, 200 tokens) β β
β β βββ detokenize (1ms) β β
β βββββββββββββββββββββββββββββββββββ β
β β
βββββββββββ Total: 953ms βββββββββββββββββProduction Monitoring Checklist
- TTFT, ITL, throughput dashboards
- GPU utilization, memory, temperature
- KV-cache pressure alerts
- Cost per token tracking
- Error rate and request queue depth
- Model accuracy/quality monitoring (LLM judge)
- Trace sampling for request debugging
- SLO definition (P99 TTFT under 3s, error rate under 1%)