Observability at enterprise scale is not βinstall Prometheus and add Grafana.β It is a data engineering problem: millions of time series, terabytes of logs per day, distributed traces across hundreds of services, and alerting that does not wake people up for false positives.
The LGTM+ Stack
The industry-standard open source observability stack:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Grafana β
β Dashboards β Alerting β Explore β SLOs β
ββββββββ¬βββββββ¬βββββββββββ¬βββββββββββββββββββββ€
β Loki β Mimirβ Tempo β Pyroscope β
β Logs βMetricsβ Traces β Profiles β
ββββββββ΄βββββββ΄βββββββββββ΄βββββββββββββββββββββ€
β OpenTelemetry Collector β
β Receive β Process β Export β Sample β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Kubernetes Cluster(s) β
β Pods β OTel SDK / Auto-instrumentation β
βββββββββββββββββββββββββββββββββββββββββββββββ- Mimir β horizontally scalable Prometheus (replaces standalone Prometheus at scale)
- Loki β log aggregation indexed by labels (not full-text)
- Tempo β distributed tracing backend
- Pyroscope β continuous profiling
- Grafana β unified visualization, alerting, and exploration
Why Not Datadog/New Relic?
At enterprise scale (50+ clusters, 10M+ active series, 5TB+ logs/day), commercial observability costs $500K-2M+/year. The LGTM stack on Kubernetes costs 10-20% of that in infrastructure, with full data ownership.
OpenTelemetry: The Collection Layer
OTel is the CNCF standard for instrumentation. One SDK, three signal types:
# OpenTelemetry Collector configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
processors:
batch:
timeout: 5s
send_batch_size: 8192
memory_limiter:
limit_mib: 1024
spike_limit_mib: 256
tail_sampling:
decision_wait: 10s
policies:
- name: errors-only
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 1000
exporters:
otlp/mimir:
endpoint: "mimir-distributor:4317"
otlp/loki:
endpoint: "loki-distributor:4317"
otlp/tempo:
endpoint: "tempo-distributor:4317"
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [otlp/mimir]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/loki]
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]Tail sampling is critical: only store error traces and slow requests in full. Sample normal traces at 1-5%. This reduces storage costs by 80%+ with minimal visibility loss.
Sizing Guide
| Cluster Size | Active Series | Log Volume | Mimir | Loki | Storage/Month |
|---|---|---|---|---|---|
| Small (under 50 nodes) | under 500K | under 100 GB/day | 3 replicas | 3 replicas | 2 TB |
| Medium (50-200 nodes) | 500K-2M | 100-500 GB/day | 6 replicas | 6 replicas | 10 TB |
| Large (200-1000 nodes) | 2M-10M | 500 GB-2 TB/day | 12+ replicas | 12+ replicas | 50 TB |
| XL (1000+ nodes) | 10M+ | 2+ TB/day | 24+ replicas | 24+ replicas | 200+ TB |
AI-Powered Alerting
Traditional threshold alerts generate noise. AI-powered anomaly detection reduces false positives:
- Grafana ML β built-in anomaly detection for Prometheus metrics
- Prophet integration β time-series forecasting for capacity planning
- Adaptive thresholds β baselines adjust for day-of-week and time-of-day patterns
- Alert correlation β group related alerts into incidents automatically
# Grafana alerting rule with anomaly detection
apiVersion: 1
groups:
- name: ai-anomaly-detection
rules:
- alert: AnomalousLatencySpike
expr: |
(
http_request_duration_seconds:p95
>
predict_linear(http_request_duration_seconds:p95[1h], 600)
* 1.5
)
for: 5m
labels:
severity: warning
annotations:
summary: "Latency anomaly detected on {{ $labels.service }}"SLO-Based Alerting
Move from βis this metric above a threshold?β to βare we meeting our SLOs?β
# Error budget-based alerting
- alert: ErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{code=~"2.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning 14.4x faster than sustainable"This fires when your error rate will exhaust the monthly error budget within 1 hour β a much more actionable signal than βerror rate above 1%.β
Related Resources
- Install Prometheus on Ubuntu
- Install Grafana on Ubuntu
- Platform Engineering Metrics
- Kubernetes Security Hardening
- FinOps for AI
About the Author
I am Luca Berton, AI and Cloud Advisor. I build observability platforms for enterprises running Kubernetes at scale. Book a consultation.