Deploying an AI model to production is only half the battle. The real challenge begins when your model starts making decisions at scale. Models drift, performance degrades silently, and users experience subtle degradation before metrics turn red.
This article covers building comprehensive monitoring for RHEL AI models using Prometheus and Grafana, ensuring you catch problems before they cost you.
Unlike traditional applications, AI systems face unique observability challenges:
Effective monitoring catches these issues at their source.
Here’s a production-ready monitoring stack for RHEL AI:
flowchart TB
subgraph Models["RHEL AI Models"]
vLLM["vLLM Serve"]
Granite["Fine-tuned<br/>Granite"]
Pipeline["Inference<br/>Pipeline"]
end
subgraph Prometheus["Prometheus Server"]
GPU["GPU metrics"]
ModelMetrics["Model metrics"]
AppMetrics["App metrics"]
SysMetrics["System metrics"]
end
subgraph Grafana["Grafana Dashboards"]
Perf["Real-time performance"]
Drift["Model drift detection"]
Heatmaps["Resource heatmaps"]
Alerts["Alert summary"]
end
subgraph AlertMgr["AlertManager & Automation"]
Slack["Slack/Teams"]
Remediation["Auto-remediation"]
OnCall["On-call escalation"]
end
Models -->|Prometheus Exporters| Prometheus
Prometheus --> Grafana
Grafana --> AlertMgrOn RHEL 9:
# Install Prometheus
sudo dnf install -y prometheus
# Install Grafana
sudo dnf install -y grafana-server
# Start services
sudo systemctl start prometheus grafana-server
sudo systemctl enable prometheus grafana-server
# Verify
curl http://localhost:9090/api/v1/query?query=upEdit the Prometheus configuration:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'rhel-ai-prod'
environment: 'production'
scrape_configs:
# GPU Metrics
- job_name: 'nvidia-smi-metrics'
static_configs:
- targets: ['localhost:9400']
# vLLM Inference Metrics
- job_name: 'vllm-inference'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
# Model Drift Detector
- job_name: 'model-drift-detector'
static_configs:
- targets: ['localhost:8001']
scrape_interval: 60s
# System Metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
# Custom Application Metrics
- job_name: 'rhel-ai-app'
static_configs:
- targets: ['localhost:9200']Reload Prometheus:
sudo systemctl reload prometheusMonitor GPU health and utilization:
# GPU Memory Utilization
gpu_memory_used_bytes / gpu_memory_total_bytes
# GPU Temperature (alert if > 80°C)
gpu_temperature_celsius
# GPU Compute Utilization
gpu_utilization_percentCreate a Prometheus alert:
# /etc/prometheus/rules/gpu_alerts.yml
groups:
- name: gpu_alerts
rules:
- alert: GPUHighTemperature
expr: gpu_temperature_celsius > 80
for: 5m
annotations:
summary: "GPU temperature critical"
- alert: GPUMemoryPressure
expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.9
for: 2m
annotations:
summary: "GPU memory usage > 90%"Track model performance in real-time:
# P95 Latency (alert if > 100ms)
histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))
# Throughput (requests/second)
rate(vllm_requests_total[5m])
# Model Accuracy on Recent Data
rate(model_correct_predictions_total[1h]) / rate(model_total_predictions_total[1h])
# Latency by Model
sum(rate(vllm_request_duration_seconds_sum[5m])) by (model) /
sum(rate(vllm_requests_total[5m])) by (model)Monitor whether predictions are degrading:
# MMLU Score Drift (foundation benchmark)
mmlu_benchmark_score
# Prediction Confidence Trending Down
avg(prediction_confidence) < 0.75
# False Positive Rate Trending Up (for classification models)
rate(false_positives_total[1h]) / rate(total_predictions[1h])
# Latency Drift (slower inference = potential issues)
rate(inference_latency_p95_seconds[1h]){
"dashboard": {
"title": "RHEL AI Real-Time Inference",
"panels": [
{
"title": "Requests Per Second",
"targets": [{"expr": "rate(vllm_requests_total[5m])"}],
"type": "graph"
},
{
"title": "P50/P95/P99 Latency (ms)",
"targets": [
{"expr": "histogram_quantile(0.50, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P50"},
{"expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P95"},
{"expr": "histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "P99"}
],
"type": "graph",
"thresholds": [
{"value": 80, "color": "yellow", "label": "P95 Warning"},
{"value": 150, "color": "red", "label": "P95 Critical"}
]
},
{
"title": "GPU Utilization",
"targets": [
{"expr": "(gpu_memory_used_bytes / gpu_memory_total_bytes) * 100", "legendFormat": "Memory %"},
{"expr": "gpu_utilization_percent", "legendFormat": "Compute %"}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [{"expr": "rate(vllm_request_errors_total[5m]) / rate(vllm_requests_total[5m])"}],
"type": "stat",
"thresholds": [0, 0.001, 0.01]
}
]
}
}{
"dashboard": {
"title": "Model Drift Detection",
"panels": [
{
"title": "Prediction Confidence Trend",
"targets": [
{"expr": "avg(prediction_confidence) by (model)"}
],
"type": "graph",
"alert": "If confidence drops > 5% in 24h"
},
{
"title": "Latency Trend (Detection of Model Slowdown)",
"targets": [
{"expr": "rate(inference_latency_p95_seconds[5m])"}
],
"type": "graph"
},
{
"title": "Accuracy on Validation Set (Hourly)",
"targets": [
{"expr": "model_validation_accuracy"}
],
"type": "graph",
"thresholds": [
{"value": 0.92, "color": "red", "label": "Below SLO"}
]
},
{
"title": "Data Distribution Shift Detection",
"targets": [
{"expr": "kolmogorov_smirnov_test_p_value"}
],
"type": "stat",
"thresholds": [
{"value": 0.05, "color": "red", "label": "Significant Drift"}
]
}
]
}
}Create a Python service that monitors model drift:
# /opt/rhel-ai/drift_detector.py
import os
from prometheus_client import start_http_server, Gauge
import numpy as np
from scipy.stats import entropy
import time
# Metrics
drift_score = Gauge('model_drift_score', 'KL divergence from training distribution')
accuracy_metric = Gauge('model_accuracy', 'Current model accuracy')
latency_p95 = Gauge('inference_latency_p95_seconds', 'P95 inference latency')
def detect_drift(recent_predictions, baseline_predictions):
"""Detect model drift using KL divergence"""
recent_dist = np.histogram(recent_predictions, bins=50, range=(0, 1))[0]
recent_dist = recent_dist / recent_dist.sum()
baseline_dist = np.histogram(baseline_predictions, bins=50, range=(0, 1))[0]
baseline_dist = baseline_dist / baseline_dist.sum()
kl_divergence = entropy(recent_dist, baseline_dist)
return kl_divergence
def calculate_accuracy(predictions, ground_truth):
"""Calculate accuracy on recent data"""
return np.mean(predictions == ground_truth)
if __name__ == '__main__':
start_http_server(8001)
# Load baseline
baseline_preds = np.load('/opt/rhel-ai/baseline_predictions.npy')
while True:
# Load recent predictions from model inference logs
recent_preds = np.load('/tmp/recent_predictions.npy')
ground_truth = np.load('/tmp/ground_truth.npy')
# Calculate metrics
drift = detect_drift(recent_preds, baseline_preds)
accuracy = calculate_accuracy(recent_preds, ground_truth)
# Update Prometheus metrics
drift_score.set(drift)
accuracy_metric.set(accuracy)
# Alert if drift > threshold
if drift > 0.5:
print(f"⚠️ HIGH DRIFT DETECTED: {drift:.3f}")
time.sleep(300) # Check every 5 minutesRun the drift detector:
# Create systemd service
sudo tee /etc/systemd/system/rhel-ai-drift-detector.service > /dev/null <<EOF
[Unit]
Description=RHEL AI Model Drift Detector
After=network.target
[Service]
Type=simple
User=rhel-ai
ExecStart=/usr/bin/python3 /opt/rhel-ai/drift_detector.py
Restart=always
StandardOutput=journal
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl start rhel-ai-drift-detector
sudo systemctl enable rhel-ai-drift-detectorDefine alerts for production SLOs:
# /etc/prometheus/rules/rhel-ai-slos.yml
groups:
- name: rhel_ai_slos
rules:
# Inference Performance
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m])) > 0.08
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency {{ $value | humanizeDuration }}"
# Model Accuracy
- alert: ModelAccuracyDegraded
expr: rate(model_correct_predictions_total[1h]) / rate(model_total_predictions_total[1h]) < 0.92
for: 30m
labels:
severity: critical
annotations:
summary: "Model accuracy below SLO: {{ $value | humanizePercentage }}"
# Drift Detection
- alert: SignificantModelDrift
expr: model_drift_score > 0.5
for: 1h
labels:
severity: critical
annotations:
summary: "Model drift detected: {{ $value | humanize }}"
# System Health
- alert: GPUOutOfMemory
expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU memory critical on {{ $labels.instance }}"As defined in Practical RHEL AI, define SLOs that map to SLIs:
| Metric | Target | Warning | Critical |
|---|---|---|---|
| P95 Latency | ≤80ms | 100ms | 150ms |
| Model Accuracy | ≥92% | 91% | 88% |
| Uptime | 99.9% | 99.5% | 99% |
| Error Rate | less than 0.1% | 0.5% | 1% |
| GPU Util | 60-80% | greater than 90% | greater than 95% |
Key metrics from the book:
With comprehensive monitoring in place:
The next article will explore AI governance, security, and compliance—ensuring your RHEL AI deployment meets enterprise standards.
Want complete monitoring coverage for your AI infrastructure?
Practical RHEL AI includes comprehensive observability guidance:
Practical RHEL AI gives you the monitoring tools to meet enterprise SLOs and detect issues before they impact users.
Learn More →Buy on Amazon →