Why Standard HPA Fails for AI Inference
Horizontal Pod Autoscaler uses CPU/memory metrics β but AI inference workloads have unique scaling signals:
- Queue depth β requests waiting for GPU availability
- Token throughput β tokens/second per pod
- GPU utilization β DCGM metrics showing compute saturation
- Time-to-first-token (TTFT) β latency degradation signals overload
- KV-cache pressure β memory exhaustion before CPU/GPU max out
KEDA solves this with event-driven scaling from 60+ sources, including Prometheus metrics that expose these AI-specific signals.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β KEDA Operator β
β ββββββββββββββ ββββββββββββββ ββββββββββββ β
β β Prometheus β β Kafka β β Cron β β
β β Scaler β β Scaler β β Scaler β β
β βββββββ¬βββββββ βββββββ¬βββββββ ββββββ¬ββββββ β
ββββββββββΌβββββββββββββββββΌβββββββββββββββΌβββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β HPA (auto-managed) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββ
β vLLM Pod ββ vLLM Pod ββ vLLM Pod β
β (GPU) ββ (GPU) ββ (GPU) β
ββββββββββββββββββββββββββββββββββββKEDA ScaledObject for vLLM
Scale on Pending Requests (Queue Depth)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference
namespace: ai-inference
spec:
scaleTargetRef:
name: vllm-deployment
pollingInterval: 10
cooldownPeriod: 120
minReplicaCount: 1
maxReplicaCount: 8
triggers:
# Scale when pending requests exceed 5 per pod
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: |
sum(vllm:num_requests_waiting) /
count(kube_pod_info{namespace="ai-inference",pod=~"vllm.*"})
threshold: "5"
activationThreshold: "2"
# Scale on GPU utilization
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: |
avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-inference",pod=~"vllm.*"})
threshold: "80"
# Pre-scale for business hours
- type: cron
metadata:
timezone: Europe/Amsterdam
start: "30 7 * * 1-5"
end: "0 20 * * 1-5"
desiredReplicas: "3"Scale on KV-Cache Pressure
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: |
avg(vllm:gpu_cache_usage_perc{namespace="ai-inference"})
threshold: "85"
metricName: kv_cache_pressureScale on Token Throughput Degradation
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: |
avg(rate(vllm:generation_tokens_total[2m])) /
count(kube_pod_info{namespace="ai-inference",pod=~"vllm.*"})
# Scale when per-pod throughput drops below 500 tokens/s
threshold: "500"
activationThreshold: "400"Scale-to-Zero for Dev/Staging
Save GPU costs when no requests are queued:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-dev
spec:
scaleTargetRef:
name: vllm-dev-deployment
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 2
triggers:
- type: prometheus
metadata:
query: |
sum(increase(vllm:num_requests_running[5m]))
threshold: "1"
activationThreshold: "0"
advanced:
restoreToOriginalReplicaCount: false
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300Cold start consideration: GPU pod initialization takes 30-120 seconds (model loading). Use a startup probe with generous timeout:
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 30 # Up to 5 minutes for large modelsNVIDIA DCGM Metrics for Scaling
Deploy DCGM Exporter to expose GPU metrics to Prometheus:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
ports:
- containerPort: 9400
env:
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
volumes:
- name: nvidia
hostPath:
path: /usr/local/nvidiaKey GPU Metrics for Autoscaling
| Metric | Description | Scale Trigger |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization % | > 80% |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization | > 70% |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (MB) | > 90% capacity |
DCGM_FI_DEV_POWER_USAGE | Power draw (Watts) | Approaching TDP |
Multi-Signal Scaling Strategy
Production systems combine multiple signals:
triggers:
# Primary: queue depth (immediate demand)
- type: prometheus
metadata:
query: sum(vllm:num_requests_waiting)
threshold: "10"
# Secondary: latency degradation (quality signal)
- type: prometheus
metadata:
query: histogram_quantile(0.95, rate(vllm:request_latency_bucket[5m]))
threshold: "5" # P95 > 5s = overloaded
# Tertiary: GPU saturation (capacity signal)
- type: prometheus
metadata:
query: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"})
threshold: "85"
# Scheduled pre-scaling (predictive)
- type: cron
metadata:
timezone: UTC
start: "0 8 * * 1-5"
end: "0 10 * * 1-5"
desiredReplicas: "5"Scaling Best Practices for AI Inference
1. Use Activation Thresholds
Prevent unnecessary scale-ups from noise:
metadata:
threshold: "80"
activationThreshold: "60" # Only activate scaling at 60%2. Asymmetric Scale Behavior
Scale up fast, scale down slow (GPU pods are expensive to restart):
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 2
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 1203. Pod Disruption Budget
Never kill all inference pods simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-inference4. Topology-Aware Scheduling
Ensure new pods land on GPU nodes:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-H100-SXM5-80GBCost Impact
| Scenario | Without KEDA | With KEDA | Savings |
|---|---|---|---|
| Dev cluster (8h/day usage) | 24/7 GPU = $2,400/mo | 8h/day = $800/mo | 67% |
| Prod (variable traffic) | 8 pods always = $19,200/mo | 2-8 pods = $9,600/mo | 50% |
| Batch inference (nights) | Manual scaling | 0β4 pods on schedule | 80% |