When a model doesnβt fit on a single GPU, you have two paths: multi-GPU inference (multiple GPUs in one server) or distributed inference (multiple servers connected by network). The choice determines your latency, throughput, cost, and failure modes.
The Core Difference
| Aspect | Multi-GPU (Single Node) | Distributed (Multi-Node) |
|---|---|---|
| Interconnect | NVLink/NVSwitch (900 GB/s) | InfiniBand/RoCE (400 Gb/s) |
| Latency per hop | 1-5 ΞΌs | 5-50 ΞΌs |
| Max GPUs | 8 (DGX H100) | Hundreds |
| Failure domain | Single machine | Per-node isolation |
| Complexity | Moderate | High |
| Cost per GPU | Higher (DGX premium) | Lower (commodity nodes) |
Multi-GPU Inference: Single Node
When to Use
- Model fits in 2-8 GPUs worth of VRAM
- Latency is critical (sub-100ms Time-To-First-Token)
- You have DGX or HGX systems with NVLink
Tensor Parallelism (TP)
Split each layerβs weight matrices across GPUs. Every GPU computes part of every token:
# vLLM with tensor parallelism on 4 GPUs
apiVersion: apps/v1
kind: Deployment
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3.1-70B
- --tensor-parallel-size=4
- --gpu-memory-utilization=0.92
- --max-model-len=32768
resources:
limits:
nvidia.com/gpu: 4How it works:
Input tokens
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Layer N: Weight matrix split 4 ways β
β GPU 0: W[:,0:d/4] β
β GPU 1: W[:,d/4:d/2] β
β GPU 2: W[:,d/2:3d/4] β
β GPU 3: W[:,3d/4:d] β
βββββββββββββββββββββββββββββββββββββββ
β AllReduce across NVLink (900 GB/s)
βΌ
Combined output β next layerPros:
- Lowest latency β all GPUs work on every token simultaneously
- NVLink bandwidth makes AllReduce nearly free
- vLLM, TensorRT-LLM, and NVIDIA NIM handle it natively
Cons:
- Limited to GPUs within one node (typically 8 max)
- AllReduce frequency = once per layer = high communication
- If one GPU fails, entire inference stops
Memory Math
For Llama 3.1 70B in FP16:
- Parameters: 70B Γ 2 bytes = 140 GB
- KV cache (32K context, batch 8): ~40 GB
- Activations: ~10 GB
- Total: ~190 GB β fits on 4Γ H100 80GB
Distributed Inference: Multi-Node
When to Use
- Model exceeds single-node GPU memory (405B+ parameters)
- You need massive throughput (hundreds of concurrent users)
- Running on commodity GPU nodes without NVLink
- Building fault-tolerant inference clusters
Pipeline Parallelism (PP)
Split layers across nodes. Each node processes a subset of layers sequentially:
# NVIDIA NIM multi-node deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: llama-405b-node
spec:
replicas: 2 # 2 nodes Γ 8 GPUs = 16 GPUs total
template:
spec:
containers:
- name: nim
env:
- name: TENSOR_PARALLEL_SIZE
value: "8"
- name: PIPELINE_PARALLEL_SIZE
value: "2"
- name: NIM_LEADER_ADDRESS
value: "llama-405b-node-0.llama-405b-node:5557"How it works:
Node 0 (Layers 0-39) Node 1 (Layers 40-79)
ββββββββββββββββββββ ββββββββββββββββββββ
β GPU 0-7: TP=8 ββββββββββββ GPU 0-7: TP=8 β
β Layers 0-39 β Network β Layers 40-79 β
β (NVLink internal)β (IB/RoCE)β (NVLink internal)β
ββββββββββββββββββββ ββββββββββββββββββββPros:
- Scales to arbitrary model sizes (405B, 1T+)
- Each node only holds partial weights β lower per-node memory
- Can use cheaper nodes without NVLink interconnect between nodes
- Micro-batching hides network latency
Cons:
- Pipeline bubbles β GPUs idle while waiting for activations
- Network bandwidth becomes bottleneck for large activations
- Higher TTFT due to sequential node processing
- Complex orchestration and failure handling
Expert Parallelism (EP) for MoE Models
For Mixture-of-Experts models (Mixtral, DeepSeek-V3):
Node 0: Experts 0-15 Node 1: Experts 16-31
ββββββββββββββββββββ ββββββββββββββββββββ
β Router selects β β Router selects β
β top-K experts ββββββΆβ top-K experts β
β for each token βββββββ for each token β
ββββββββββββββββββββ ββββββββββββββββββββOnly activated experts need computation β reducing total FLOPs while maintaining model capacity.
The Network Bottleneck
The critical difference between multi-GPU and distributed is interconnect bandwidth:
| Interconnect | Bandwidth | Latency | Use Case |
|---|---|---|---|
| NVLink 4.0 (H100) | 900 GB/s | ~1 ΞΌs | Intra-node TP |
| NVSwitch (DGX) | 900 GB/s all-to-all | ~1 ΞΌs | Intra-node TP |
| InfiniBand HDR | 200 Gb/s (25 GB/s) | ~2 ΞΌs | Multi-node PP |
| InfiniBand NDR | 400 Gb/s (50 GB/s) | ~2 ΞΌs | Multi-node PP |
| RoCE v2 | 100-400 Gb/s | ~5 ΞΌs | Budget multi-node |
| TCP/Ethernet | 25-100 Gb/s | ~50 ΞΌs | Donβt do this |
For a 70B model with TP=8 across nodes (not recommended):
- AllReduce per layer: ~8 GB of data
- At 400 Gb/s: 160ms per layer Γ 80 layers = 12.8 seconds per token
- At 900 GB/s NVLink: 9ms per layer Γ 80 layers = 0.7 seconds per token
Rule of thumb: Tensor parallelism should stay within NVLink. Use pipeline parallelism across nodes.
NVIDIA Dynamo: Disaggregated Serving
NVIDIA Dynamo introduces a third option β disaggregated prefill and decode:
βββββββββββββββββββ βββββββββββββββββββ
β Prefill Nodes β β Decode Nodes β
β (Compute-heavy) ββββββΆβ (Memory-bound) β
β 8Γ H100 each β KV β 4Γ H100 each β
β Batch prefill βcacheβ Continuous decodeβ
βββββββββββββββββββ βββββββββββββββββββ- Prefill is compute-bound β pack onto fewer, fully-utilized GPUs
- Decode is memory-bandwidth-bound β spread across more GPUs with lighter load
- KV cache transfers between nodes via RDMA
This disaggregation can improve throughput per dollar by 2-4x compared to monolithic serving.
Decision Framework
Model Size β Architecture
| Model | VRAM Needed (FP16) | Recommended Setup |
|---|---|---|
| 7-8B | 16 GB | 1Γ GPU |
| 13B | 26 GB | 1Γ H100 or 2Γ A100 |
| 34B | 68 GB | 1Γ H100 80GB |
| 70B | 140 GB | 2Γ H100 (TP=2) or 4Γ A100 (TP=4) |
| 70B + long context | 200 GB | 4Γ H100 (TP=4) |
| 405B | 810 GB | 2 nodes Γ 8 H100 (TP=8, PP=2) |
| 1T+ MoE | 1+ TB | 4+ nodes with EP |
Latency vs Throughput Optimization
Optimize for latency (chatbots, real-time):
- Maximize tensor parallelism within one node
- Use larger GPUs (H100 80GB over A100 40GB)
- Sacrifice throughput for faster single-request response
Optimize for throughput (batch processing, offline):
- Use pipeline parallelism with micro-batching
- Fill pipeline bubbles with concurrent requests
- Continuous batching to maximize GPU utilization
- Consider autoscaling inference based on queue depth
KV Cache: The Hidden Bottleneck
At 128K context length, KV cache dominates memory:
KV cache per token per layer:
2 (K+V) Γ num_heads Γ head_dim Γ 2 bytes (FP16)
Llama 3.1 70B at 128K context:
2 Γ 64 Γ 128 Γ 2 Γ 80 layers Γ 128,000 tokens = 167 GB
That's MORE than the model weights (140 GB)!Strategies:
- PagedAttention (vLLM): Allocate KV cache in pages, avoid fragmentation
- KV cache quantization: FP8 or INT8 reduces cache by 2-4x
- Prefix caching: Share KV cache for common system prompts
- Offloading: Spill cold KV cache to CPU RAM or SSD
Multi-GPU on Kubernetes
# Multi-GPU single node (simple)
apiVersion: v1
kind: Pod
spec:
containers:
- name: inference
resources:
limits:
nvidia.com/gpu: 8
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3,4,5,6,7"
---
# Multi-node distributed (complex)
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: llama-405b
spec:
replicas: 2
leaderWorkerTemplate:
size: 2
workerTemplate:
spec:
containers:
- name: nim
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device: 1For multi-tenant GPU sharing, tools like MIG (Multi-Instance GPU), time-slicing, and MPS allow multiple inference workloads on the same physical GPU.
Cost Comparison
Running Llama 3.1 70B inference (1000 req/min target):
| Setup | Hardware | Monthly Cost* | Throughput | $/1M tokens |
|---|---|---|---|---|
| 4Γ H100 SXM (1 node) | DGX H100 | ~$35K | 1200 req/min | $0.42 |
| 8Γ A100 80GB (1 node) | DGX A100 | ~$25K | 800 req/min | $0.47 |
| 2Γ 4-GPU nodes (A100) | Commodity | ~$18K | 700 req/min | $0.39 |
| Cloud API (GPT-4o) | - | Usage-based | Unlimited | $2.50 |
*Approximate cloud instance pricing, varies by provider.
The commodity multi-node approach can be cheaper per token, but operational complexity is higher.
Failure Modes
| Failure | Multi-GPU Impact | Distributed Impact |
|---|---|---|
| Single GPU fault | Entire node down | One node down, others continue |
| Memory error (ECC) | Process crash | Affected node restarts |
| Network partition | N/A | Split-brain, request failures |
| Thermal throttling | Reduced throughput | Affected node slower |
Distributed systems offer better fault isolation but introduce network partition risks. For production, implement:
- Health checks per node
- Automatic failover to backup nodes
- Request retry with timeout
- Graceful degradation (serve smaller model as fallback)
Related Articles
- NVIDIA NIM Multi-Node Deployment β production multi-node setup
- NVIDIA NIM Multinode Inference β Docker-based distributed serving
- NVIDIA Dynamo β disaggregated prefill/decode
- NCCL Timeout Troubleshooting β fixing multi-GPU communication issues
- GPU Cost Optimization β the economics of inference at scale
Start with the smallest setup that meets your latency SLA. Scale horizontally only when single-node capacity is exhausted. The best GPU cluster is the one you donβt over-provision.