Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Autoscale AI Inference on Kubernetes with KEDA + GPUs
AI

Autoscale AI Inference on Kubernetes with KEDA + GPUs

Scale LLM inference pods based on queue depth, token throughput, and GPU utilization using KEDA. Production patterns for vLLM, TGI, and NIM on Kubernetes.

LB
Luca Berton
Β· 2 min read

Why Standard HPA Fails for AI Inference

Horizontal Pod Autoscaler uses CPU/memory metrics β€” but AI inference workloads have unique scaling signals:

  • Queue depth β€” requests waiting for GPU availability
  • Token throughput β€” tokens/second per pod
  • GPU utilization β€” DCGM metrics showing compute saturation
  • Time-to-first-token (TTFT) β€” latency degradation signals overload
  • KV-cache pressure β€” memory exhaustion before CPU/GPU max out

KEDA solves this with event-driven scaling from 60+ sources, including Prometheus metrics that expose these AI-specific signals.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  KEDA Operator                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Prometheus β”‚  β”‚   Kafka    β”‚  β”‚   Cron   β”‚  β”‚
β”‚  β”‚  Scaler   β”‚  β”‚  Scaler   β”‚  β”‚  Scaler  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                β”‚              β”‚
         β–Ό                β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              HPA (auto-managed)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό           β–Ό           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ vLLM Pod β”‚β”‚ vLLM Pod β”‚β”‚ vLLM Pod β”‚
   β”‚  (GPU)   β”‚β”‚  (GPU)   β”‚β”‚  (GPU)   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

KEDA ScaledObject for vLLM

Scale on Pending Requests (Queue Depth)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference
  namespace: ai-inference
spec:
  scaleTargetRef:
    name: vllm-deployment
  pollingInterval: 10
  cooldownPeriod: 120
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
    # Scale when pending requests exceed 5 per pod
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        query: |
          sum(vllm:num_requests_waiting) /
          count(kube_pod_info{namespace="ai-inference",pod=~"vllm.*"})
        threshold: "5"
        activationThreshold: "2"
    # Scale on GPU utilization
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        query: |
          avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-inference",pod=~"vllm.*"})
        threshold: "80"
    # Pre-scale for business hours
    - type: cron
      metadata:
        timezone: Europe/Amsterdam
        start: "30 7 * * 1-5"
        end: "0 20 * * 1-5"
        desiredReplicas: "3"

Scale on KV-Cache Pressure

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      query: |
        avg(vllm:gpu_cache_usage_perc{namespace="ai-inference"})
      threshold: "85"
      metricName: kv_cache_pressure

Scale on Token Throughput Degradation

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      query: |
        avg(rate(vllm:generation_tokens_total[2m])) /
        count(kube_pod_info{namespace="ai-inference",pod=~"vllm.*"})
      # Scale when per-pod throughput drops below 500 tokens/s
      threshold: "500"
      activationThreshold: "400"

Scale-to-Zero for Dev/Staging

Save GPU costs when no requests are queued:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-dev
spec:
  scaleTargetRef:
    name: vllm-dev-deployment
  minReplicaCount: 0  # Scale to zero!
  maxReplicaCount: 2
  triggers:
    - type: prometheus
      metadata:
        query: |
          sum(increase(vllm:num_requests_running[5m]))
        threshold: "1"
        activationThreshold: "0"
  advanced:
    restoreToOriginalReplicaCount: false
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300

Cold start consideration: GPU pod initialization takes 30-120 seconds (model loading). Use a startup probe with generous timeout:

startupProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 30  # Up to 5 minutes for large models

NVIDIA DCGM Metrics for Scaling

Deploy DCGM Exporter to expose GPU metrics to Prometheus:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
          ports:
            - containerPort: 9400
          env:
            - name: DCGM_EXPORTER_COLLECTORS
              value: /etc/dcgm-exporter/dcp-metrics-included.csv
          volumeMounts:
            - name: nvidia
              mountPath: /usr/local/nvidia
      volumes:
        - name: nvidia
          hostPath:
            path: /usr/local/nvidia

Key GPU Metrics for Autoscaling

MetricDescriptionScale Trigger
DCGM_FI_DEV_GPU_UTILGPU compute utilization %> 80%
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization> 70%
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MB)> 90% capacity
DCGM_FI_DEV_POWER_USAGEPower draw (Watts)Approaching TDP

Multi-Signal Scaling Strategy

Production systems combine multiple signals:

triggers:
  # Primary: queue depth (immediate demand)
  - type: prometheus
    metadata:
      query: sum(vllm:num_requests_waiting)
      threshold: "10"
  # Secondary: latency degradation (quality signal)
  - type: prometheus
    metadata:
      query: histogram_quantile(0.95, rate(vllm:request_latency_bucket[5m]))
      threshold: "5"  # P95 > 5s = overloaded
  # Tertiary: GPU saturation (capacity signal)
  - type: prometheus
    metadata:
      query: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"})
      threshold: "85"
  # Scheduled pre-scaling (predictive)
  - type: cron
    metadata:
      timezone: UTC
      start: "0 8 * * 1-5"
      end: "0 10 * * 1-5"
      desiredReplicas: "5"

Scaling Best Practices for AI Inference

1. Use Activation Thresholds

Prevent unnecessary scale-ups from noise:

metadata:
  threshold: "80"
  activationThreshold: "60"  # Only activate scaling at 60%

2. Asymmetric Scale Behavior

Scale up fast, scale down slow (GPU pods are expensive to restart):

advanced:
  horizontalPodAutoscalerConfig:
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 0
        policies:
          - type: Pods
            value: 2
            periodSeconds: 30
      scaleDown:
        stabilizationWindowSeconds: 600
        policies:
          - type: Pods
            value: 1
            periodSeconds: 120

3. Pod Disruption Budget

Never kill all inference pods simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-inference

4. Topology-Aware Scheduling

Ensure new pods land on GPU nodes:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
                - NVIDIA-A100-SXM4-80GB
                - NVIDIA-H100-SXM5-80GB

Cost Impact

ScenarioWithout KEDAWith KEDASavings
Dev cluster (8h/day usage)24/7 GPU = $2,400/mo8h/day = $800/mo67%
Prod (variable traffic)8 pods always = $19,200/mo2-8 pods = $9,600/mo50%
Batch inference (nights)Manual scaling0β†’4 pods on schedule80%

Free 30-min AI & Cloud consultation

Book Now