Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
AI infrastructure stack powering production model serving
AI

Your Model Does Not Matter. Your Infrastructure Does.

Teams obsess over GPT-4o vs Claude 3.5 vs Gemini while their inference stack burns money, drops requests, and delivers 10-second latencies. In production.

LB
Luca Berton
Β· 5 min read

The Model Debate Is a Distraction

Every week, my feed fills with the same arguments: β€œClaude is better at coding.” β€œGPT-4o is faster.” β€œGemini has the longest context.” β€œLlama 3 is catching up.”

None of this matters in production.

I have spent the past two years helping organizations deploy AI at scale β€” from financial services to manufacturing to defense. The pattern is always the same: teams spend months evaluating models, pick one, deploy it, and then discover that everything around the model determines whether the system actually works.

The model is a commodity. The infrastructure is the moat.

What Actually Breaks in Production

1. GPU Utilization Is Embarrassingly Low

Most enterprise GPU clusters run at 15-30% utilization. Teams request 8Γ— A100 nodes, deploy one model, and leave 70% of the compute idle because they lack the orchestration layer to pack workloads efficiently.

At $30,000/month per 8Γ— A100 node, that is $21,000/month in waste β€” per node.

The fix is not a better model. It is proper GPU scheduling:

# NVIDIA MIG partitioning for mixed workloads
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      mixed-workload:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1   # Large inference model
            "2g.20gb": 1   # Medium fine-tuning job
            "1g.10gb": 2   # Small batch jobs

With Multi-Instance GPU (MIG) and proper scheduling, I have pushed utilization to 75-85% on the same hardware β€” effectively tripling capacity without buying a single additional GPU.

2. Cold Start Kills User Experience

Loading a 70B parameter model from storage to GPU memory takes 45-90 seconds. Users will not wait. But keeping every model hot in GPU memory costs a fortune.

Production pattern: Tiered model caching

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Hot Tier (GPU VRAM)         β”‚  ← Active models, sub-100ms inference
β”‚  Always loaded, highest cost β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Warm Tier (Host RAM)        β”‚  ← Recently used models, 2-5s load to GPU
β”‚  Preloaded, medium cost      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Cold Tier (NVMe/S3)         β”‚  ← All other models, 30-90s load
β”‚  Storage only, lowest cost   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The infrastructure decides which tier a model lives in based on request patterns β€” not the model team manually managing GPU allocations in a spreadsheet.

3. Autoscaling Inference Is Not Like Autoscaling Web Servers

You cannot scale GPU inference the way you scale HTTP pods. A Kubernetes HPA watching CPU utilization is useless for GPU workloads because:

  • GPU provisioning takes 2-5 minutes (node startup + model loading)
  • Request patterns are bursty (all-or-nothing, not gradual ramp)
  • Cost per replica is 100x higher than a web pod

Production autoscaling for inference:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: llm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_queue_depth
        query: |
          avg(inference_pending_requests{model="llama-70b"})
        threshold: "10"
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30   # React fast
          policies:
            - type: Pods
              value: 2
              periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 300  # Cool down slow (GPUs are expensive to restart)
          policies:
            - type: Pods
              value: 1
              periodSeconds: 120

Scale up aggressively on queue depth. Scale down conservatively because restarting a GPU pod wastes 5 minutes of expensive compute.

4. Networking Is the Silent Bottleneck

A single inference request for a 70B model transfers 140 GB of weights across the network during model loading. Multi-node inference for 400B+ models requires NCCL all-reduce across nodes β€” and if your network cannot sustain 100 Gbps+ with RDMA, your inter-node communication becomes the bottleneck.

I wrote about this in detail:

The networking infrastructure determines your maximum model size, not your GPU count. 8 GPUs connected by 10 GbE ethernet cannot serve models that 4 GPUs connected by 400 Gbps InfiniBand handle easily.

5. Observability Is Missing Entirely

Most teams have zero visibility into their inference stack. They know the model is β€œrunning” but cannot answer basic questions:

  • What is the P95 time-to-first-token?
  • How many requests are queued right now?
  • Which users are consuming 80% of GPU time?
  • Is the model degrading on specific input types?

Production inference observability stack:

# Essential inference metrics (Prometheus)
- inference_request_duration_seconds (histogram)
- inference_time_to_first_token_seconds (histogram)
- inference_tokens_per_second (gauge)
- inference_queue_depth (gauge)
- inference_gpu_utilization_percent (gauge)
- inference_gpu_memory_used_bytes (gauge)
- inference_cache_hit_rate (gauge)        # KV cache reuse
- inference_request_total (counter)        # by model, user, status
- inference_cost_dollars_total (counter)   # chargeback

Without this, you are flying blind. And blind flying with $30K/month hardware is not a strategy.

The Infrastructure Stack That Wins

After dozens of production deployments, this is the stack I recommend:

Compute Layer

  • Kubernetes with NVIDIA GPU Operator for GPU lifecycle management
  • MIG partitioning for mixed workloads on A100/H100
  • Node auto-provisioning via Karpenter (not Cluster Autoscaler β€” Karpenter makes GPU-aware decisions)

Serving Layer

  • vLLM for LLM inference (PagedAttention for efficient KV cache management)
  • NVIDIA NIM for enterprise-supported model serving with model profiles
  • NVIDIA Dynamo for disaggregated serving at extreme scale

Networking Layer

  • RDMA over Converged Ethernet (RoCE) with PFC enabled for lossless GPU-to-GPU communication
  • SR-IOV with NVIDIA Network Operator for bare-metal network performance in containers
  • GPUDirect RDMA for zero-copy data transfer between NIC and GPU

Orchestration Layer

  • NVIDIA Run:ai or Volcano for GPU-aware scheduling and quota management
  • KEDA for queue-depth-based autoscaling
  • Argo Workflows for model deployment pipelines

Observability Layer

  • Prometheus + Grafana with DCGM exporter for GPU metrics
  • OpenTelemetry for distributed tracing across the inference pipeline
  • Custom dashboards for cost attribution and chargeback

The Cost of Getting Infrastructure Wrong

Here is a real scenario I see regularly:

DecisionWrong ApproachRight ApproachAnnual Cost Delta
GPU sizingRequest 8Γ— H100 β€œjust in case”Right-size with MIG + autoscaling-$180,000
Model loadingCold start from S3 every timeTiered caching with warm pool-$45,000 (+ user retention)
NetworkingStandard 25 GbERoCE with PFCEnables 2x larger models
ObservabilityNoneFull stackCatches $10K+/month waste
SchedulingDefault K8s schedulerGPU-aware (Run:ai/Volcano)+60% utilization

Total: $225,000+ annual savings on a single cluster. And that is before counting the revenue impact of 3-second response times vs. 15-second response times.

The Model Commodity Cycle

We have seen this before:

  • Databases: In 2005, everyone debated Oracle vs. SQL Server vs. MySQL. Today, it is a commodity β€” the value is in schema design, query optimization, and operational excellence.
  • Cloud compute: In 2010, AWS vs. Azure was a religious war. Today, compute is fungible β€” the value is in architecture, cost optimization, and multi-cloud strategy.
  • AI models: In 2026, GPT vs. Claude vs. Gemini is the debate. By 2028, models will be largely interchangeable β€” the value will be entirely in infrastructure.

The organizations that invest in infrastructure now will have compounding advantages as models commoditize. Those that chase the latest model will keep rewriting their stack every six months.

What To Do Monday Morning

  1. Measure your GPU utilization. If it is under 50%, you have an infrastructure problem worth six figures annually.
  2. Instrument your inference pipeline. Time-to-first-token, queue depth, tokens/second, cost per request.
  3. Implement tiered model caching. Stop paying for cold starts.
  4. Evaluate your network. If you are running multi-node inference on standard ethernet, you are leaving performance on the table.
  5. Abstract the model layer. Build your application against an interface, not a specific model. When you need to swap models β€” and you will β€” it should be a config change, not a rewrite.

The model is the easy part. The infrastructure is where production AI lives or dies.


Need help building production AI infrastructure? I design GPU clusters, inference pipelines, and Kubernetes platforms that actually work at enterprise scale.

Book an AI Infrastructure Assessment β†’

Free 30-min AI & Cloud consultation

Book Now