Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Cut LLM Inference Costs 70% on Kubernetes (Proven Tactics)
AI

Cut LLM Inference Costs 70% on Kubernetes (Proven Tactics)

Reduce AI inference costs with spot GPUs, KV-cache sharing, request batching, model quantization, and intelligent routing. Real-world savings from production deployments.

LB
Luca Berton
Β· 1 min read

The Cost Problem

Running LLMs in production is expensive. A single A100 80GB costs $2-4/hour on cloud. At scale:

ModelMin GPUCloud Cost/HourMonthly (24/7)
Llama 3.1 8B1x A10G$1.00$730
Llama 3.1 70B2x A100 80GB$6.50$4,745
Mixtral 8x22B4x A100 80GB$13.00$9,490
Llama 3.1 405B8x H100$32.00$23,360

The good news: most teams overspend by 50-70%. Here’s how to fix that.

1. Spot/Preemptible GPUs (Save 60-70%)

AWS Spot Instances with Karpenter

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.xlarge", "g5.2xlarge", "p4d.24xlarge"]
        - key: nvidia.com/gpu
          operator: Exists
      nodeClassRef:
        name: gpu-nodes
  limits:
    nvidia.com/gpu: "16"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

Handling Spot Interruptions

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 120
      containers:
        - name: vllm
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 90"]  # Drain in-flight requests

Key insight: Inference workloads are stateless β€” perfect for spot. Unlike training, a killed inference pod loses nothing (no checkpoints needed). Just route traffic to remaining pods.

2. Model Quantization (Save 50% GPU Memory)

GPTQ (4-bit) vs AWQ vs GGUF

MethodMemory ReductionQuality LossSpeed Impact
FP16 (baseline)0%0%0%
INT850%< 1%+5% slower
GPTQ 4-bit75%1-3%+10% slower
AWQ 4-bit75%< 1%Similar to FP16
GGUF Q4_K_M75%1-2%CPU-friendly

vLLM with AWQ Quantized Model

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama70b-awq
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "TheBloke/Llama-3.1-70B-AWQ"
            - "--quantization"
            - "awq"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.95"
          resources:
            limits:
              nvidia.com/gpu: "1"  # 70B on SINGLE A100 with AWQ!

Cost impact: Llama 70B goes from 2x A100 ($6.50/hr) to 1x A100 ($3.25/hr) = 50% savings with negligible quality loss.

3. KV-Cache Optimization

KV-cache is the #1 memory consumer during inference. Optimizing it directly reduces GPU count.

PagedAttention (vLLM)

vLLM’s PagedAttention manages KV-cache like virtual memory pages:

args:
  - "--model"
  - "meta-llama/Llama-3.1-70B"
  - "--gpu-memory-utilization"
  - "0.92"           # Use 92% of GPU memory for KV-cache
  - "--max-num-seqs"
  - "256"            # Max concurrent sequences
  - "--enable-prefix-caching"  # Reuse KV for shared prefixes

Prefix Caching (System Prompt Sharing)

If all requests share the same system prompt (common in production):

args:
  - "--enable-prefix-caching"
  # Saves 20-40% KV-cache memory when system prompts are shared

KV-Cache Compression (llm-d / Dynamo)

The latest approach: disaggregated KV-cache with compression:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Prefill  │───▢│  KV Store    │◀───│  Decode  β”‚
β”‚  Node    β”‚    β”‚ (Compressed) β”‚    β”‚  Nodes   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. Request Batching (Double Throughput)

Continuous Batching

vLLM and TGI use continuous batching β€” new requests join the batch as soon as any sequence finishes:

args:
  - "--max-num-seqs"
  - "64"              # Batch up to 64 concurrent requests
  - "--max-num-batched-tokens"
  - "32768"           # Total tokens in a batch

Batch Size vs Latency Trade-off

Batch SizeThroughput (tok/s)P50 LatencyP99 Latency
14522ms28ms
832025ms45ms
321,10035ms120ms
641,80055ms250ms
1282,20095ms500ms

Sweet spot: 32-64 batch size balances throughput and latency for most production workloads.

5. Intelligent Routing (Match Model to Task)

Not every request needs your largest model. Route by complexity:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ai-router
  annotations:
    nginx.ingress.kubernetes.io/configuration-snippet: |
      # Route based on request header
      if ($http_x_model_tier = "fast") {
        proxy_pass http://llama-8b.ai-inference:8000;
      }
      if ($http_x_model_tier = "quality") {
        proxy_pass http://llama-70b.ai-inference:8000;
      }

Cost Impact of Model Routing

Request TypeModelCost/1M tokensQuality
ClassificationLlama 8B$0.0595% accuracy
SummarizationLlama 8B$0.0590% quality
Code generationLlama 70B$0.4095% quality
Complex reasoningLlama 70B$0.4098% quality

Route 60% of traffic to the small model = 60% cost reduction with minimal quality impact.

6. Right-Size GPU Selection

Model SizeMinimum GPUOptimal GPUMonthly Cost
7-8BA10G (24GB)L4 (24GB)$350
13BA10G (24GB)A100 40GB$730
34BA100 40GBA100 80GB$1,825
70B2x A100 80GBH100 80GB$3,650
70B (AWQ)A100 80GBH100 80GB$1,825

L4 GPUs are the sweet spot for 7-8B models β€” 50% cheaper than A10G with better inference performance.

7. Auto-Scale to Zero (Dev/Test)

KEDA scale-to-zero eliminates idle GPU costs entirely:

spec:
  minReplicaCount: 0
  triggers:
    - type: prometheus
      metadata:
        query: sum(rate(vllm:num_requests_running[5m]))
        threshold: "1"

Dev clusters: Scale to zero outside working hours = 70% savings on GPU compute.

Total Savings Calculator

OptimizationSavingsCumulative
Spot GPUs60%60%
Quantization (AWQ 4-bit)50% of remaining80%
Request batching2x throughput = 50% fewer pods90%
Model routing30% of remaining93%
Scale-to-zero (dev)70% of dev costs-

Real example: A team spending $20K/month on inference reduced to $6K/month by implementing spot + quantization + routing β€” 70% reduction.

Free 30-min AI & Cloud consultation

Book Now