Monitoring and Observability for RHEL AI Workloads
Deploying an AI model to production is only half the battle. The real challenge begins when your model starts making decisions at scale. Models drift, performance degrades silently, and users experience subtle degradation before metrics turn red.
This article covers building comprehensive monitoring for RHEL AI models using Prometheus and Grafana, ensuring you catch problems before they cost you.
The AI Monitoring Challenge
Unlike traditional applications, AI systems face unique observability challenges:
- Model Drift: Model accuracy degrades as real-world data diverges from training data
- Silent Failures: A model returning plausible-but-wrong answers
- Resource Utilization: GPU memory, latency, and throughput vary with model load
- Fairness Issues: Systematic bias in predictions across demographic groups
- Cold Start Problems: Performance drops with new features or data distributions
Effective monitoring catches these issues at their source.
Architecture Overview
Here’s a production-ready monitoring stack for RHEL AI:
flowchart TB
subgraph Models["RHEL AI Models"]
vLLM["vLLM Serve"]
Granite["Fine-tuned<br/>Granite"]
Pipeline["Inference<br/>Pipeline"]
end
subgraph Prometheus["Prometheus Server"]
GPU["GPU metrics"]
ModelMetrics["Model metrics"]
AppMetrics["App metrics"]
SysMetrics["System metrics"]
end
subgraph Grafana["Grafana Dashboards"]
Perf["Real-time performance"]
Drift["Model drift detection"]
Heatmaps["Resource heatmaps"]
Alerts["Alert summary"]
end
subgraph AlertMgr["AlertManager & Automation"]
Slack["Slack/Teams"]
Remediation["Auto-remediation"]
OnCall["On-call escalation"]
end
Models -->|Prometheus Exporters| Prometheus
Prometheus --> Grafana
Grafana --> AlertMgrStep 1: Install Prometheus and Grafana
On RHEL 9:
# Install Prometheus
sudo dnf install -y prometheus
# Install Grafana
sudo dnf install -y grafana-server
# Start services
sudo systemctl start prometheus grafana-server
sudo systemctl enable prometheus grafana-server
# Verify
curl http://localhost:9090/api/v1/query?query=upStep 2: Configure Prometheus for RHEL AI
Edit the Prometheus configuration:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'rhel-ai-prod'
environment: 'production'
scrape_configs:
# GPU Metrics
- job_name: 'nvidia-smi-metrics'
static_configs:
- targets: ['localhost:9400']
# vLLM Inference Metrics
- job_name: 'vllm-inference'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
# Model Drift Detector
- job_name: 'model-drift-detector'
static_configs:
- targets: ['localhost:8001']
scrape_interval: 60s
# System Metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
# Custom Application Metrics
- job_name: 'rhel-ai-app'
static_configs:
- targets: ['localhost:9200']Reload Prometheus:
sudo systemctl reload prometheusStep 3: Key Metrics to Monitor
GPU Metrics
Monitor GPU health and utilization:
# GPU Memory Utilization
gpu_memory_used_bytes / gpu_memory_total_bytes
# GPU Temperature (alert if > 80°C)
gpu_temperature_celsius
# GPU Compute Utilization
gpu_utilization_percentCreate a Prometheus alert:
# /etc/prometheus/rules/gpu_alerts.yml
groups:
- name: gpu_alerts
rules:
- alert: GPUHighTemperature
expr: gpu_temperature_celsius > 80
for: 5m
annotations:
summary: "GPU temperature critical"
- alert: GPUMemoryPressure
expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.9
for: 2m
annotations:
summary: "GPU memory usage > 90%"Model Inference Metrics
Track model performance in real-time:
# P95 Latency (alert if > 100ms)
histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))
# Throughput (requests/second)
rate(vllm_requests_total[5m])
# Model Accuracy on Recent Data
rate(model_correct_predictions_total[1h]) / rate(model_total_predictions_total[1h])
# Latency by Model
sum(rate(vllm_request_duration_seconds_sum[5m])) by (model) /
sum(rate(vllm_requests_total[5m])) by (model)Model Drift Detection
Monitor whether predictions are degrading:
# MMLU Score Drift (foundation benchmark)
mmlu_benchmark_score
# Prediction Confidence Trending Down
avg(prediction_confidence) < 0.75
# False Positive Rate Trending Up (for classification models)
rate(false_positives_total[1h]) / rate(total_predictions[1h])
# Latency Drift (slower inference = potential issues)
rate(inference_latency_p95_seconds[1h])Step 4: Build Grafana Dashboards
Dashboard 1: Real-Time Inference Performance
{
"dashboard": {
"title": "RHEL AI Real-Time Inference",
"panels": [
{
"title": "Requests Per Second",
"targets": [{"expr": "rate(vllm_requests_total[5m])"}],
"type": "graph"
},
{
"title": "P50/P95/P99 Latency (ms)",
"targets": [
{"expr": "histogram_quantile(0.50, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P50"},
{"expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P95"},
{"expr": "histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P99"}
],
"type": "graph",
"thresholds": [
{"value": 80, "color": "yellow", "label": "P95 Warning"},
{"value": 150, "color": "red", "label": "P95 Critical"}
]
},
{
"title": "GPU Utilization",
"targets": [
{"expr": "(gpu_memory_used_bytes / gpu_memory_total_bytes) * 100", "legendFormat": "Memory %"},
{"expr": "gpu_utilization_percent", "legendFormat": "Compute %"}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [{"expr": "rate(vllm_request_errors_total[5m]) / rate(vllm_requests_total[5m])"}],
"type": "stat",
"thresholds": [0, 0.001, 0.01]
}
]
}
}Dashboard 2: Model Drift Detection
{
"dashboard": {
"title": "Model Drift Detection",
"panels": [
{
"title": "Prediction Confidence Trend",
"targets": [
{"expr": "avg(prediction_confidence) by (model)"}
],
"type": "graph",
"alert": "If confidence drops > 5% in 24h"
},
{
"title": "Latency Trend (Detection of Model Slowdown)",
"targets": [
{"expr": "rate(inference_latency_p95_seconds[5m])"}
],
"type": "graph"
},
{
"title": "Accuracy on Validation Set (Hourly)",
"targets": [
{"expr": "model_validation_accuracy"}
],
"type": "graph",
"thresholds": [
{"value": 0.92, "color": "red", "label": "Below SLO"}
]
},
{
"title": "Data Distribution Shift Detection",
"targets": [
{"expr": "kolmogorov_smirnov_test_p_value"}
],
"type": "stat",
"thresholds": [
{"value": 0.05, "color": "red", "label": "Significant Drift"}
]
}
]
}
}Step 5: Set Up Automated Drift Detection
Create a Python service that monitors model drift:
# /opt/rhel-ai/drift_detector.py
import os
from prometheus_client import start_http_server, Gauge
import numpy as np
from scipy.stats import entropy
import time
# Metrics
drift_score = Gauge('model_drift_score', 'KL divergence from training distribution')
accuracy_metric = Gauge('model_accuracy', 'Current model accuracy')
latency_p95 = Gauge('inference_latency_p95_seconds', 'P95 inference latency')
def detect_drift(recent_predictions, baseline_predictions):
"""Detect model drift using KL divergence"""
recent_dist = np.histogram(recent_predictions, bins=50, range=(0, 1))[0]
recent_dist = recent_dist / recent_dist.sum()
baseline_dist = np.histogram(baseline_predictions, bins=50, range=(0, 1))[0]
baseline_dist = baseline_dist / baseline_dist.sum()
kl_divergence = entropy(recent_dist, baseline_dist)
return kl_divergence
def calculate_accuracy(predictions, ground_truth):
"""Calculate accuracy on recent data"""
return np.mean(predictions == ground_truth)
if __name__ == '__main__':
start_http_server(8001)
# Load baseline
baseline_preds = np.load('/opt/rhel-ai/baseline_predictions.npy')
while True:
# Load recent predictions from model inference logs
recent_preds = np.load('/tmp/recent_predictions.npy')
ground_truth = np.load('/tmp/ground_truth.npy')
# Calculate metrics
drift = detect_drift(recent_preds, baseline_preds)
accuracy = calculate_accuracy(recent_preds, ground_truth)
# Update Prometheus metrics
drift_score.set(drift)
accuracy_metric.set(accuracy)
# Alert if drift > threshold
if drift > 0.5:
print(f"⚠️ HIGH DRIFT DETECTED: {drift:.3f}")
time.sleep(300) # Check every 5 minutesRun the drift detector:
# Create systemd service
sudo tee /etc/systemd/system/rhel-ai-drift-detector.service > /dev/null <<EOF
[Unit]
Description=RHEL AI Model Drift Detector
After=network.target
[Service]
Type=simple
User=rhel-ai
ExecStart=/usr/bin/python3 /opt/rhel-ai/drift_detector.py
Restart=always
StandardOutput=journal
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl start rhel-ai-drift-detector
sudo systemctl enable rhel-ai-drift-detectorStep 6: Set Up Alerting
Define alerts for production SLOs:
# /etc/prometheus/rules/rhel-ai-slos.yml
groups:
- name: rhel_ai_slos
rules:
# Inference Performance
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m])) > 0.08
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency {{ $value | humanizeDuration }}"
# Model Accuracy
- alert: ModelAccuracyDegraded
expr: rate(model_correct_predictions_total[1h]) / rate(model_total_predictions_total[1h]) < 0.92
for: 30m
labels:
severity: critical
annotations:
summary: "Model accuracy below SLO: {{ $value | humanizePercentage }}"
# Drift Detection
- alert: SignificantModelDrift
expr: model_drift_score > 0.5
for: 1h
labels:
severity: critical
annotations:
summary: "Model drift detected: {{ $value | humanize }}"
# System Health
- alert: GPUOutOfMemory
expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU memory critical on {{ $labels.instance }}"Dashboard Views to Check Daily
- Morning Standup: Check overnight accuracy, drift scores, and error rates
- During Peak Hours: Monitor latency and throughput SLOs
- Weekly Review: Analyze trends, identify slow degradation
- Monthly Analysis: Calculate cost per inference, ROI on GPU investment
Recommended SLOs (Chapter 6)
As defined in Practical RHEL AI, define SLOs that map to SLIs:
| Metric | Target | Warning | Critical |
|---|---|---|---|
| P95 Latency | ≤80ms | 100ms | 150ms |
| Model Accuracy | ≥92% | 91% | 88% |
| Uptime | 99.9% | 99.5% | 99% |
| Error Rate | less than 0.1% | 0.5% | 1% |
| GPU Util | 60-80% | greater than 90% | greater than 95% |
Key metrics from the book:
- GPU thermals and cgroup pressure
- vLLM latency buckets
- MMLU drift scores for proactive alerts
Next Steps
With comprehensive monitoring in place:
- Define alerting strategies for your specific domain
- Build runbooks for responding to common alerts
- Automate remediation (scale GPU, trigger retraining, rollback)
- Track business metrics alongside technical metrics
Resources
- Prometheus Documentation
- Grafana Dashboard Templates
- Model Monitoring Best Practices
- vLLM Metrics Guide
The next article will explore AI governance, security, and compliance—ensuring your RHEL AI deployment meets enterprise standards.
Build Production-Grade Observability
Want complete monitoring coverage for your AI infrastructure?
Practical RHEL AI includes comprehensive observability guidance:
- ✅ Ready-to-import Grafana dashboards
- ✅ Prometheus alerting rules for GPU thermals and cgroup pressure
- ✅ MMLU drift detection scripts
- ✅ vLLM latency bucket configuration
- ✅ SLO/SLI templates for AI workloads
📊 Achieve P95 ≤ 80ms Latency
Practical RHEL AI gives you the monitoring tools to meet enterprise SLOs and detect issues before they impact users.
Learn More →Buy on Amazon →