A GPU cluster without monitoring is flying blind. You need to know which GPUs are idle, which jobs are memory-bound, and whether your expensive H100s are actually being utilized.
I have set up monitoring stacks for Slurm clusters that surface the metrics teams actually need. Here is the practical approach.
The Monitoring Stack
NVIDIA DCGM โ dcgm-exporter โ Prometheus โ Grafana
Slurm โ slurm-exporter โ Prometheus โ Grafana
Node โ node-exporter โ Prometheus โ GrafanaThree exporters feed Prometheus. Grafana provides the dashboards.
NVIDIA DCGM Exporter
DCGM (Data Center GPU Manager) exposes detailed GPU metrics. The exporter runs on each compute node:
# Install DCGM
dnf install -y datacenter-gpu-manager
# Run the exporter
docker run -d --gpus all --rm \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
# Or install natively
dcgm-exporter --address :9400Key metrics exposed:
DCGM_FI_DEV_GPU_UTIL # GPU utilization (%)
DCGM_FI_DEV_FB_USED # GPU memory used (MB)
DCGM_FI_DEV_FB_FREE # GPU memory free (MB)
DCGM_FI_DEV_GPU_TEMP # GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE # Power consumption (W)
DCGM_FI_DEV_PCIE_TX_THROUGHPUT # PCIe TX bandwidth
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL # NVLink bandwidth
DCGM_FI_PROF_SM_ACTIVE # Streaming multiprocessor activity
DCGM_FI_PROF_SM_OCCUPANCY # SM occupancyCustom DCGM Metrics
Create a custom CSV to select specific counters:
# /etc/dcgm-exporter/custom-counters.csv
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
DCGM_FI_DEV_FB_USED, gauge, GPU memory used (MiB)
DCGM_FI_DEV_POWER_USAGE, gauge, Power usage (W)
DCGM_FI_PROF_SM_ACTIVE, gauge, SM active ratio
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core active ratio
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink bandwidthThe Tensor Core active ratio is particularly useful โ it tells you whether your training job is actually using the hardware efficiently or wasting cycles on data loading.
Slurm Exporter
The prometheus-slurm-exporter exposes scheduler metrics:
# Install
go install github.com/vpenso/prometheus-slurm-exporter@latest
# Run on the Slurm controller
prometheus-slurm-exporter --listen-address=:9341Metrics include:
slurm_queue_pending # Pending jobs by partition
slurm_queue_running # Running jobs by partition
slurm_nodes_alloc # Allocated nodes
slurm_nodes_idle # Idle nodes
slurm_cpus_total # Total CPUs
slurm_gpus_total # Total GPUs
slurm_gpus_alloc # Allocated GPUs
slurm_scheduler_backfill_cycle_seconds # Backfill scheduler performancePrometheus Configuration
# prometheus.yml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'dcgm'
static_configs:
- targets:
- 'gpu-node01:9400'
- 'gpu-node02:9400'
- 'gpu-node03:9400'
# Or use service discovery
file_sd_configs:
- files: ['/etc/prometheus/gpu-nodes.json']
- job_name: 'slurm'
static_configs:
- targets: ['slurmctl01:9341']
- job_name: 'node'
static_configs:
- targets:
- 'gpu-node01:9100'
- 'gpu-node02:9100'For large clusters, use file-based service discovery so you can update targets without restarting Prometheus.
Essential Grafana Dashboards
Cluster Overview Dashboard
# GPU Utilization Heatmap
avg by (Hostname) (DCGM_FI_DEV_GPU_UTIL)
# Total GPU Memory Usage
sum(DCGM_FI_DEV_FB_USED) / sum(DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100
# Cluster GPU Allocation Rate
slurm_gpus_alloc / slurm_gpus_total * 100
# Job Queue Depth
slurm_queue_pendingPer-Job GPU Efficiency
# Average GPU utilization per node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL{job="dcgm"})
# Tensor Core utilization (are you using the expensive hardware?)
avg(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE) * 100
# GPU memory waste (allocated but unused)
1 - (avg(DCGM_FI_DEV_FB_USED) / avg(DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))Power and Thermal
# Total cluster power draw
sum(DCGM_FI_DEV_POWER_USAGE)
# GPU temperature distribution
histogram_quantile(0.95, DCGM_FI_DEV_GPU_TEMP)
# Power efficiency (FLOPS per watt)
avg(DCGM_FI_PROF_SM_ACTIVE) / avg(DCGM_FI_DEV_POWER_USAGE)Alerting Rules
# prometheus-rules.yml
groups:
- name: gpu-cluster
rules:
- alert: GPUUtilizationLow
expr: avg by (Hostname) (DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels:
severity: warning
annotations:
summary: "GPU utilization below 20% on {{ $labels.Hostname }}"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: critical
annotations:
summary: "GPU temperature {{ $value }}C on {{ $labels.Hostname }}"
- alert: QueueBacklog
expr: slurm_queue_pending > 100
for: 1h
labels:
severity: warning
annotations:
summary: "{{ $value }} pending jobs in Slurm queue"Integration with OpenTelemetry
For organizations using OpenTelemetry, you can bridge DCGM metrics into your existing observability stack:
# otel-collector config
receivers:
prometheus:
config:
scrape_configs:
- job_name: dcgm
static_configs:
- targets: ['localhost:9400']
exporters:
otlp:
endpoint: otel-collector.monitoring:4317
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [otlp]Automating Deployment
Deploy the monitoring stack across all nodes with Ansible:
- name: Deploy DCGM exporter
hosts: gpu_nodes
tasks:
- name: Install DCGM
dnf:
name: datacenter-gpu-manager
state: present
- name: Start DCGM exporter
systemd:
name: dcgm-exporter
state: started
enabled: yesKey Metrics to Watch
From experience, these are the numbers that matter most:
- GPU utilization below 50% โ your jobs are likely data-loading bound
- Tensor Core activity below 30% โ you are paying for Tensor Cores but not using them
- Queue wait time above 4 hours โ you need more capacity or better scheduling policies
- GPU allocation above 90% โ you will hit contention soon, plan capacity
For help designing your monitoring infrastructure, visit my services page or check the GPU Cost Calculator to plan capacity.