The Model Debate Is a Distraction
Every week, my feed fills with the same arguments: βClaude is better at coding.β βGPT-4o is faster.β βGemini has the longest context.β βLlama 3 is catching up.β
None of this matters in production.
I have spent the past two years helping organizations deploy AI at scale β from financial services to manufacturing to defense. The pattern is always the same: teams spend months evaluating models, pick one, deploy it, and then discover that everything around the model determines whether the system actually works.
The model is a commodity. The infrastructure is the moat.
What Actually Breaks in Production
1. GPU Utilization Is Embarrassingly Low
Most enterprise GPU clusters run at 15-30% utilization. Teams request 8Γ A100 nodes, deploy one model, and leave 70% of the compute idle because they lack the orchestration layer to pack workloads efficiently.
At $30,000/month per 8Γ A100 node, that is $21,000/month in waste β per node.
The fix is not a better model. It is proper GPU scheduling:
# NVIDIA MIG partitioning for mixed workloads
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
data:
config.yaml: |
version: v1
mig-configs:
mixed-workload:
- devices: [0]
mig-enabled: true
mig-devices:
"3g.40gb": 1 # Large inference model
"2g.20gb": 1 # Medium fine-tuning job
"1g.10gb": 2 # Small batch jobsWith Multi-Instance GPU (MIG) and proper scheduling, I have pushed utilization to 75-85% on the same hardware β effectively tripling capacity without buying a single additional GPU.
2. Cold Start Kills User Experience
Loading a 70B parameter model from storage to GPU memory takes 45-90 seconds. Users will not wait. But keeping every model hot in GPU memory costs a fortune.
Production pattern: Tiered model caching
ββββββββββββββββββββββββββββββββ
β Hot Tier (GPU VRAM) β β Active models, sub-100ms inference
β Always loaded, highest cost β
ββββββββββββββββββββββββββββββββ€
β Warm Tier (Host RAM) β β Recently used models, 2-5s load to GPU
β Preloaded, medium cost β
ββββββββββββββββββββββββββββββββ€
β Cold Tier (NVMe/S3) β β All other models, 30-90s load
β Storage only, lowest cost β
ββββββββββββββββββββββββββββββββThe infrastructure decides which tier a model lives in based on request patterns β not the model team manually managing GPU allocations in a spreadsheet.
3. Autoscaling Inference Is Not Like Autoscaling Web Servers
You cannot scale GPU inference the way you scale HTTP pods. A Kubernetes HPA watching CPU utilization is useless for GPU workloads because:
- GPU provisioning takes 2-5 minutes (node startup + model loading)
- Request patterns are bursty (all-or-nothing, not gradual ramp)
- Cost per replica is 100x higher than a web pod
Production autoscaling for inference:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-scaler
spec:
scaleTargetRef:
name: llm-inference
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_queue_depth
query: |
avg(inference_pending_requests{model="llama-70b"})
threshold: "10"
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # React fast
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Cool down slow (GPUs are expensive to restart)
policies:
- type: Pods
value: 1
periodSeconds: 120Scale up aggressively on queue depth. Scale down conservatively because restarting a GPU pod wastes 5 minutes of expensive compute.
4. Networking Is the Silent Bottleneck
A single inference request for a 70B model transfers 140 GB of weights across the network during model loading. Multi-node inference for 400B+ models requires NCCL all-reduce across nodes β and if your network cannot sustain 100 Gbps+ with RDMA, your inter-node communication becomes the bottleneck.
I wrote about this in detail:
- Enable Priority Flow Control on Mellanox ConnectX NICs
- Linux NIC Tuning for Performance
- NVIDIA NIM Multi-Node Deployment on Kubernetes
The networking infrastructure determines your maximum model size, not your GPU count. 8 GPUs connected by 10 GbE ethernet cannot serve models that 4 GPUs connected by 400 Gbps InfiniBand handle easily.
5. Observability Is Missing Entirely
Most teams have zero visibility into their inference stack. They know the model is βrunningβ but cannot answer basic questions:
- What is the P95 time-to-first-token?
- How many requests are queued right now?
- Which users are consuming 80% of GPU time?
- Is the model degrading on specific input types?
Production inference observability stack:
# Essential inference metrics (Prometheus)
- inference_request_duration_seconds (histogram)
- inference_time_to_first_token_seconds (histogram)
- inference_tokens_per_second (gauge)
- inference_queue_depth (gauge)
- inference_gpu_utilization_percent (gauge)
- inference_gpu_memory_used_bytes (gauge)
- inference_cache_hit_rate (gauge) # KV cache reuse
- inference_request_total (counter) # by model, user, status
- inference_cost_dollars_total (counter) # chargebackWithout this, you are flying blind. And blind flying with $30K/month hardware is not a strategy.
The Infrastructure Stack That Wins
After dozens of production deployments, this is the stack I recommend:
Compute Layer
- Kubernetes with NVIDIA GPU Operator for GPU lifecycle management
- MIG partitioning for mixed workloads on A100/H100
- Node auto-provisioning via Karpenter (not Cluster Autoscaler β Karpenter makes GPU-aware decisions)
Serving Layer
- vLLM for LLM inference (PagedAttention for efficient KV cache management)
- NVIDIA NIM for enterprise-supported model serving with model profiles
- NVIDIA Dynamo for disaggregated serving at extreme scale
Networking Layer
- RDMA over Converged Ethernet (RoCE) with PFC enabled for lossless GPU-to-GPU communication
- SR-IOV with NVIDIA Network Operator for bare-metal network performance in containers
- GPUDirect RDMA for zero-copy data transfer between NIC and GPU
Orchestration Layer
- NVIDIA Run:ai or Volcano for GPU-aware scheduling and quota management
- KEDA for queue-depth-based autoscaling
- Argo Workflows for model deployment pipelines
Observability Layer
- Prometheus + Grafana with DCGM exporter for GPU metrics
- OpenTelemetry for distributed tracing across the inference pipeline
- Custom dashboards for cost attribution and chargeback
The Cost of Getting Infrastructure Wrong
Here is a real scenario I see regularly:
| Decision | Wrong Approach | Right Approach | Annual Cost Delta |
|---|---|---|---|
| GPU sizing | Request 8Γ H100 βjust in caseβ | Right-size with MIG + autoscaling | -$180,000 |
| Model loading | Cold start from S3 every time | Tiered caching with warm pool | -$45,000 (+ user retention) |
| Networking | Standard 25 GbE | RoCE with PFC | Enables 2x larger models |
| Observability | None | Full stack | Catches $10K+/month waste |
| Scheduling | Default K8s scheduler | GPU-aware (Run:ai/Volcano) | +60% utilization |
Total: $225,000+ annual savings on a single cluster. And that is before counting the revenue impact of 3-second response times vs. 15-second response times.
The Model Commodity Cycle
We have seen this before:
- Databases: In 2005, everyone debated Oracle vs. SQL Server vs. MySQL. Today, it is a commodity β the value is in schema design, query optimization, and operational excellence.
- Cloud compute: In 2010, AWS vs. Azure was a religious war. Today, compute is fungible β the value is in architecture, cost optimization, and multi-cloud strategy.
- AI models: In 2026, GPT vs. Claude vs. Gemini is the debate. By 2028, models will be largely interchangeable β the value will be entirely in infrastructure.
The organizations that invest in infrastructure now will have compounding advantages as models commoditize. Those that chase the latest model will keep rewriting their stack every six months.
What To Do Monday Morning
- Measure your GPU utilization. If it is under 50%, you have an infrastructure problem worth six figures annually.
- Instrument your inference pipeline. Time-to-first-token, queue depth, tokens/second, cost per request.
- Implement tiered model caching. Stop paying for cold starts.
- Evaluate your network. If you are running multi-node inference on standard ethernet, you are leaving performance on the table.
- Abstract the model layer. Build your application against an interface, not a specific model. When you need to swap models β and you will β it should be a config change, not a rewrite.
The model is the easy part. The infrastructure is where production AI lives or dies.
Need help building production AI infrastructure? I design GPU clusters, inference pipelines, and Kubernetes platforms that actually work at enterprise scale.
Book an AI Infrastructure Assessment β