What AI and cloud consulting services does Luca Berton offer?

Luca Berton provides expert consulting in AI/ML platform strategy, multi-tenant GPU orchestration on OpenShift AI, MLOps enablement, cloud infrastructure design, Kubernetes workshops, and Ansible & Python training.

What is Ansible Pilot?

Ansible Pilot is the leading resource for Ansible automation learning, featuring a YouTube channel with 6.1K subscribers and 1M+ views, plus AnsiblePilot.com with 648K total users.

How can I book a consultation with Luca Berton?

Schedule a free consultation through Calendly at calendly.com/lucaberton or visit lucaberton.com/contact.

Model Observability: Monitoring LLM Performance in Production

Luca Berton • Thu Feb 26 2026 • 1 min read •

#llm#observability#monitoring#opentelemetry#production

📊 You Can’t Manage What You Can’t Measure

LLMs in production need different monitoring than traditional services. Response quality can degrade silently — the service is “up” but answers are wrong. Here’s how to catch problems before users do.

The Four Pillars of LLM Observability

1. Performance Metrics

from prometheus_client import Histogram, Counter, Gauge

# Track latency by operation
llm_latency = Histogram(
    'llm_request_duration_seconds',
    'LLM request latency',
    ['model', 'operation'],  # operation: generate, embed, rerank
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Track token throughput
tokens_processed = Counter(
    'llm_tokens_total',
    'Total tokens processed',
    ['model', 'direction']  # direction: input, output
)

# Track queue depth
queue_depth = Gauge(
    'llm_queue_depth',
    'Number of requests waiting',
    ['model']
)

2. Quality Metrics

async def evaluate_response(query, response, context=None):
    metrics = {}
    
    # Faithfulness: Does the response match the provided context?
    if context:
        metrics["faithfulness"] = await check_faithfulness(response, context)
    
    # Relevance: Does the response answer the question?
    metrics["relevance"] = await check_relevance(query, response)
    
    # Toxicity: Is the response safe?
    metrics["toxicity_score"] = await toxicity_classifier(response)
    
    # Length: Unreasonably short/long responses indicate problems
    metrics["response_length"] = len(response.split())
    
    return metrics

3. Cost Tracking

MODEL_COSTS = {
    "granite-34b": {"input": 0.0008, "output": 0.0024},  # per 1K tokens
    "granite-8b": {"input": 0.0002, "output": 0.0006},
}

def track_cost(model, input_tokens, output_tokens):
    costs = MODEL_COSTS[model]
    total = (input_tokens * costs["input"] + output_tokens * costs["output"]) / 1000
    
    cost_counter.labels(model=model).inc(total)
    daily_budget_gauge.labels(model=model).set(get_remaining_budget(model))

4. Drift Detection

# Compare response distributions over time
def detect_drift(current_window, baseline_window):
    metrics = {
        "avg_response_length": compare_distributions(
            current_window.response_lengths,
            baseline_window.response_lengths
        ),
        "avg_latency": compare_distributions(
            current_window.latencies,
            baseline_window.latencies
        ),
        "refusal_rate": current_window.refusal_count / current_window.total,
    }
    
    if metrics["refusal_rate"] > 0.1:  # More than 10% refusals
        alert("High refusal rate detected")
    
    return metrics

OpenTelemetry Integration

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer("llm-service")

@tracer.start_as_current_span("llm_generate")
async def generate(prompt, model="granite-34b"):
    span = trace.get_current_span()
    span.set_attribute("llm.model", model)
    span.set_attribute("llm.prompt_tokens", count_tokens(prompt))
    
    response = await model.generate(prompt)
    
    span.set_attribute("llm.completion_tokens", response.usage.output_tokens)
    span.set_attribute("llm.total_cost", calculate_cost(response.usage))
    
    return response

Alerting Rules

groups:
- name: llm-alerts
  rules:
  - alert: LLMHighLatency
    expr: histogram_quantile(0.95, llm_request_duration_seconds_bucket) > 5
    for: 5m
    annotations:
      summary: "LLM P95 latency above 5 seconds"
      
  - alert: LLMHighErrorRate
    expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
    for: 3m
    
  - alert: LLMBudgetExceeded
    expr: llm_daily_cost_total > 500
    annotations:
      summary: "Daily LLM spend exceeded $500"

Key Takeaways

Monitor quality, not just uptime — an LLM can be “up” and still give wrong answers
Track costs per-request — LLM costs can spiral without visibility
Detect drift — model behavior changes over time
Trace end-to-end — from user query through RAG pipeline to response
Alert on business metrics — refusal rate, user satisfaction, task completion

Need LLM observability for your AI platform? I help teams build comprehensive monitoring for production AI. Get in touch.

📌 Need expert help with this topic?

🧠

AI Integration & GPU Platforms

Need help deploying AI/ML platforms? Get expert consulting on OpenShift AI, GPU orchestration, and MLOps.

☸️

Kubernetes & Containerization

Master Kubernetes and container orchestration with hands-on workshops and architecture consulting.

Book a free consultation →

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

LinkedIn Bluesky YouTube Contact →

← Back to Blog

JSON vs TOON for AI Input: Token-Efficient Data for LLMs

Compare JSON and TOON (Token-Oriented Object Notation) for feeding structured data to Large Language Models. See how TOON cuts token counts by up to 50 percent while keeping JSON compatibility.

Tue Mar 03 2026

Building Custom AI Skills with InstructLab Taxonomy

Create domain-specific AI capabilities using InstructLab's taxonomy system—from writing skill definitions to generating synthetic training data and validating fine-tuned models.

Mon Mar 02 2026

Accessing the OpenClaw Control UI Dashboard on Azure

How to access the OpenClaw Control UI dashboard from an Azure VM — via SSH tunnel (secure) or public IP. Covers device pairing, dashboard authentication, and the browser-based management interface.

Thu Feb 26 2026