Skip to main content
๐ŸŽค Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
๐ŸŽค Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Observability-driven development with OpenTelemetry
DevOps

Observability-Driven Development

Design applications with observability from the start. OpenTelemetry instrumentation patterns, trace-based testing, and production debugging workflows.

LB
Luca Berton
ยท 1 min read

Observability-driven development means instrumenting your code for production visibility before you ship it, not after the first outage. OpenTelemetry has made this practical by providing a single standard for traces, metrics, and logs.

The OTel Standard

OpenTelemetry provides vendor-neutral APIs and SDKs for:

  • Traces โ€” follow a request across services
  • Metrics โ€” counters, histograms, gauges
  • Logs โ€” structured log events correlated with traces

The key insight: instrument once, export anywhere. Switch from Jaeger to Datadog to Grafana Tempo without changing application code.

Instrumenting a Python Service

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_exporter = OTLPSpanExporter(endpoint="otel-collector:4317")

# Initialize metrics
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument()

# Custom spans for business logic
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    # ... business logic
    request_counter.add(1, {"endpoint": "/orders"})

The Collector Architecture

The OpenTelemetry Collector sits between your applications and backends:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/tempo:
    endpoint: tempo:4317
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Kubernetes-Native Observability

On Kubernetes, deploy the OTel Collector as a DaemonSet for node-level collection and a Deployment for cluster-level aggregation. The OTel Operator automates this.

For platform engineering teams, observability is a platform capability. Provide developers with auto-instrumented sidecars so they get traces without writing instrumentation code.

Common Pitfalls

Too many custom spans โ€” every span has overhead. Instrument service boundaries and business-critical paths, not every function call.

Missing context propagation โ€” if trace context does not propagate across service boundaries, you get disconnected traces. Use W3C Trace Context headers.

Ignoring cardinality โ€” high-cardinality labels on metrics (user IDs, request IDs) will blow up your metrics storage. Use traces for high-cardinality data.

The Grafana Stack

My recommended open-source observability backend:

  • Grafana โ€” dashboards and alerting
  • Tempo โ€” distributed traces
  • Mimir โ€” Prometheus-compatible metrics (long-term)
  • Loki โ€” log aggregation

This stack handles everything from edge monitoring to AI workload observability at enterprise scale.

Instrument before you ship. Debug in production with confidence, not with panic.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut