A quick reference for PromQL โ the Prometheus query language. Bookmark this page.
Basic Queries
# Instant vector (current value)
up
node_cpu_seconds_total
container_memory_usage_bytes
# Label filtering
up{job="kubernetes-nodes"}
node_cpu_seconds_total{mode="idle", instance="node1:9100"}
http_requests_total{method=~"GET|POST"} # Regex match
http_requests_total{status!="200"} # Not equal
http_requests_total{path=~"/api/.*"} # RegexRange Vectors and Rates
# Range vector (values over time)
http_requests_total[5m] # Last 5 minutes of samples
# Rate (per-second increase of counter)
rate(http_requests_total[5m])
# Increase (total increase over period)
increase(http_requests_total[1h])
# irate (instant rate โ last two samples)
irate(http_requests_total[5m])Aggregation
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum by label
sum by (method) (rate(http_requests_total[5m]))
sum by (namespace, pod) (container_memory_usage_bytes)
# Without (exclude label)
sum without (instance) (rate(http_requests_total[5m]))
# Other aggregations
avg(node_load1)
min(node_filesystem_avail_bytes)
max(container_cpu_usage_seconds_total)
count(up == 1)
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_filesystem_avail_bytes)
quantile(0.95, http_request_duration_seconds)Common Patterns
# CPU usage percentage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total[5m])) * 100
# Memory usage percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# HTTP error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Request latency P99
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Request latency P95 by endpoint
histogram_quantile(0.95, sum by (le, path) (rate(http_request_duration_seconds_bucket[5m])))
# Pod restart count
sum by (pod, namespace) (kube_pod_container_status_restarts_total)Operators
# Arithmetic
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
container_cpu_usage_seconds_total / 1e9
# Comparison (filter)
up == 0 # Only down targets
node_load1 > 4 # Overloaded nodes
http_requests_total > 1000
# Logical
up == 1 and on(instance) node_load1 > 2
# Group operations
sum by (namespace) (kube_pod_info) > 50 # Namespaces with 50+ podsFunctions
# Math
abs(delta(temperature[1h]))
ceil(value)
floor(value)
round(value, 0.5)
# Time
time() # Current Unix timestamp
timestamp(up) # Timestamp of last sample
day_of_week() # 0=Sunday
hour() # 0-23
# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")
label_join(up, "full", "-", "job", "instance")
# Absent (alerting on missing metrics)
absent(up{job="critical-service"})
absent_over_time(up{job="api"}[5m])
# Changes and resets
changes(process_start_time_seconds[1h])
resets(http_requests_total[1h])
# Predictions
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) # Predict 24h aheadAlerting Patterns
# Service down
up == 0
# High error rate (over 5% for 5 minutes)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
# Disk will fill in 24 hours
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
# Pod CrashLooping
increase(kube_pod_container_status_restarts_total[1h]) > 5
# High memory usage
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9Tips and Tricks
- Use
rate()for counters, never use raw counter values - Use
5mrate window minimum for stable graphs - Use
sum by (label)to aggregate meaningfully - Use
histogram_quantilefor latency percentiles - Test queries in Grafana Explore before building dashboards