You cannot fix what you cannot see. This tutorial walks you through building production-grade Grafana dashboards from scratch.
Prerequisites
- Grafana 11+ running (Helm:
helm install grafana grafana/grafana) - Prometheus as a data source
- Kubernetes cluster with metrics
Creating Your First Dashboard
Step 1: Add Prometheus Data Source
Navigate to Connections > Data Sources > Add data source > Prometheus. Set the URL to http://prometheus-server:9090 and click Save & Test.
Step 2: Create a Dashboard
Click Dashboards > New Dashboard > Add visualization. Select Prometheus as the data source.
Step 3: Build Key Panels
CPU Usage Panel:
# PromQL query
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)Set visualization to Time series, title to โCPU Usage by Podโ.
Memory Usage Panel:
sum(container_memory_usage_bytes{namespace="production"}) by (pod) / 1024 / 1024HTTP Request Rate:
sum(rate(http_requests_total[5m])) by (status)Variables (Template Variables)
Variables make dashboards reusable across namespaces, clusters, and services.
# Add variable: namespace
Type: Query
Data source: Prometheus
Query: label_values(kube_pod_info, namespace)Then use $namespace in your queries:
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace"}[5m])) by (pod)Annotations
Mark deployments, incidents, and maintenance windows on your graphs:
# Deployment annotation query
changes(kube_deployment_status_observed_generation{namespace="$namespace"}[5m]) > 0Alerting
Create alerts directly from panels:
- Edit panel > Alert tab > Create alert rule
- Set condition:
WHEN avg() OF query(A) IS ABOVE 0.8 - Set evaluation interval:
1m - Configure notification channel (Slack, PagerDuty, email)
Dashboard Design Patterns
The RED Dashboard (Services)
| Row | Panels |
|---|---|
| Request Rate | sum(rate(http_requests_total[5m])) |
| Error Rate | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) |
| Duration P99 | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) |
The USE Dashboard (Infrastructure)
| Row | Panels |
|---|---|
| CPU Utilization | 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) |
| Memory Saturation | node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes |
| Disk Errors | rate(node_disk_io_time_seconds_total[5m]) |
Dashboard as Code (Provisioning)
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: Production
type: file
options:
path: /var/lib/grafana/dashboardsExport dashboards as JSON and store in Git for version control.
Tips and Tricks
- Use variables for every label you might filter on (namespace, service, cluster)
- Set meaningful Y-axis units (bytes, percent, requests/sec)
- Use the Repeat feature to auto-generate panels per variable value
- Use Library panels to share common panels across dashboards
- Export dashboards to JSON and commit to Git for version-controlled dashboards