You Canβt Improve What You Canβt Measure
An edge AI model in production without monitoring is a ticking time bomb. Models drift. Hardware degrades. Cameras get dirty. You need to know when performance drops β before the business does.
The Metrics That Matter
Inference Metrics
from prometheus_client import Histogram, Counter, Gauge, start_http_server
# Core inference metrics
INFERENCE_LATENCY = Histogram(
'inference_latency_seconds',
'Inference latency in seconds',
buckets=[0.005, 0.01, 0.015, 0.02, 0.05, 0.1, 0.5]
)
INFERENCE_COUNT = Counter(
'inference_total',
'Total inferences processed',
['model_version', 'result']
)
CONFIDENCE_SCORE = Histogram(
'prediction_confidence',
'Model confidence scores',
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
)
MODEL_VERSION = Gauge(
'model_version_info',
'Currently loaded model version',
['version']
)
# Usage in inference loop
@INFERENCE_LATENCY.time()
def run_inference(image):
prediction = model.predict(image)
INFERENCE_COUNT.labels(
model_version='3.2',
result=prediction.label
).inc()
CONFIDENCE_SCORE.observe(prediction.confidence)
return prediction
# Expose metrics on :9090/metrics
start_http_server(9090)Hardware Metrics
import subprocess
import json
GPU_TEMP = Gauge('gpu_temperature_celsius', 'GPU temperature')
GPU_UTIL = Gauge('gpu_utilization_percent', 'GPU utilization')
MEMORY_USED = Gauge('gpu_memory_used_bytes', 'GPU memory used')
def collect_jetson_metrics():
"""Collect Jetson hardware metrics via tegrastats."""
stats = subprocess.check_output(['tegrastats', '--interval', '1000', '--count', '1'])
# Parse tegrastats output
gpu_temp = parse_temp(stats)
gpu_util = parse_util(stats)
mem_used = parse_memory(stats)
GPU_TEMP.set(gpu_temp)
GPU_UTIL.set(gpu_util)
MEMORY_USED.set(mem_used)Data Quality Metrics
IMAGE_BRIGHTNESS = Histogram(
'input_image_brightness',
'Average brightness of input images',
buckets=[20, 40, 60, 80, 100, 120, 140, 160, 180, 200]
)
IMAGE_BLUR = Histogram(
'input_image_blur_score',
'Laplacian variance (blur detection)',
buckets=[10, 50, 100, 200, 500, 1000]
)
def check_image_quality(image):
brightness = image.mean()
blur = cv2.Laplacian(image, cv2.CV_64F).var()
IMAGE_BRIGHTNESS.observe(brightness)
IMAGE_BLUR.observe(blur)
if brightness < 30:
alert("Camera may be obstructed or lighting failed")
if blur < 50:
alert("Camera may be out of focus")Prometheus Configuration
# prometheus.yml - Central Prometheus scraping edge devices
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'edge-ai-fleet'
file_sd_configs:
- files:
- /etc/prometheus/edge_targets/*.json
refresh_interval: 5m
# Or use DNS service discovery if devices register
- job_name: 'edge-ai-dns'
dns_sd_configs:
- names:
- '_metrics._tcp.edge.internal'
type: SRV
refresh_interval: 60s
# Alert rules
rule_files:
- /etc/prometheus/rules/edge_ai_alerts.ymlAlert Rules
# edge_ai_alerts.yml
groups:
- name: edge_ai
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(inference_latency_seconds_bucket[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High inference latency on {{ $labels.instance }}"
description: "P95 latency is {{ $value }}s (threshold: 50ms)"
- alert: LowConfidenceSpike
expr: rate(prediction_confidence_bucket{le="0.5"}[15m]) / rate(prediction_confidence_count[15m]) > 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "Model confidence dropping on {{ $labels.instance }}"
description: ">10% of predictions below 0.5 confidence β possible model drift"
- alert: GPUOverheating
expr: gpu_temperature_celsius > 85
for: 2m
labels:
severity: critical
annotations:
summary: "GPU overheating on {{ $labels.instance }}"
- alert: DeviceUnreachable
expr: up{job="edge-ai-fleet"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Edge device {{ $labels.instance }} unreachable"Model Accuracy Drift Detection
The most insidious edge AI failure: the model slowly gets worse. New product variants, lighting changes, seasonal differences.
# Track accuracy against known-good reference images
VALIDATION_ACCURACY = Gauge(
'model_validation_accuracy',
'Accuracy on reference validation set'
)
def periodic_validation():
"""Run every hour against reference images."""
correct = 0
total = 0
for image, expected in REFERENCE_SET:
prediction = model.predict(image)
if prediction.label == expected:
correct += 1
total += 1
accuracy = correct / total
VALIDATION_ACCURACY.set(accuracy)
if accuracy < 0.95:
alert(f"Model accuracy dropped to {accuracy:.1%} β retrain needed")Grafana Dashboard
Key panels for the edge AI fleet dashboard:
- Fleet Overview β map or table showing all devices, status, model version
- Inference Performance β latency p50/p95/p99 per device, aggregated
- Model Confidence Distribution β histogram showing confidence spread (drift = left shift)
- Hardware Health β GPU temp, utilization, memory across fleet
- Defect Rate Trend β is the defect detection rate changing? Could indicate model issue OR real quality issue
- Alert Timeline β recent alerts with severity
Monitoring edge AI is harder than monitoring cloud services because devices fail in ways cloud servers donβt β dirty cameras, power fluctuations, physical damage. Build your dashboards with these failure modes in mind.
