A single 8ΓH100 node has 640 GB of GPU memory. DeepSeek-R1 at 671 billion parameters needs over 1.2 TB in FP16. The math does not work on one machine.
Multinode inference splits a model across multiple physical servers, connected via high-speed networking, serving requests as if it were a single endpoint. NVIDIA NIM makes this surprisingly straightforward β but the infrastructure requirements are serious.
When You Need Multinode
The decision is simple: if the model does not fit in one nodeβs total GPU memory, you need multinode.
| Model | Parameters | FP16 Memory | FP8 Memory | Single 8ΓH100? |
|---|---|---|---|---|
| Llama 3.1 70B | 70B | 140 GB | 70 GB | β Yes |
| Mixtral 8Γ22B | 141B (MoE) | ~90 GB active | ~45 GB | β Yes |
| Llama 3.1 405B | 405B | 810 GB | 405 GB | β FP16, β FP8 |
| DeepSeek-R1 | 671B (MoE) | 1.3 TB | 671 GB | β No |
| Nemotron 340B | 340B | 680 GB | 340 GB | β FP16, β FP8 |
The 8ΓH100 80GB node at 640 GB total is the reference point. Anything above that needs either quantization to fit, or multinode to spread.
Parallelism Strategies
Multinode inference uses two parallelism techniques, often combined:
Tensor Parallelism (TP)
Splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate during every forward pass.
Layer N:
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β GPU 0 β GPU 1 β GPU 2 β GPU 3 β
β slice 0 β slice 1 β slice 2 β slice 3 β
ββββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ
βββββββββββΌββββββββββΌββββββββββ
AllReduce (every layer)Within a node: NVLink at 900 GB/s β fast enough for TP across 8 GPUs.
Across nodes: InfiniBand at 400 Gb/s (50 GB/s) β 18Γ slower than NVLink. TP across nodes works but adds latency to every single layer computation.
Pipeline Parallelism (PP)
Splits the model by layers. Each node holds a contiguous block of layers. Communication only happens between pipeline stages.
Node 0 Node 1
ββββββββββββββββββββ ββββββββββββββββββββ
β Layers 0-39 ββββββΆβ Layers 40-79 β
β (8Γ H100) β β (8Γ H100) β
ββββββββββββββββββββ ββββββββββββββββββββ
Stage 0 βββΆ Stage 1
Inter-node transfer
(once per stage, not per layer)Key advantage: PP sends activations between nodes once per stage, not once per layer. This makes it far more tolerant of inter-node bandwidth limitations.
The Practical Combination
Most multinode deployments use TP within nodes + PP across nodes:
2-Node DeepSeek-R1 Deployment:
βββββββββββββββββββββββββββββ
Node 0: TP=8 across 8Γ H100 (layers 0-39)
Node 1: TP=8 across 8Γ H100 (layers 40-79)
PP=2 across the two nodes
Total: TP=8, PP=2 β 16 GPUsNVIDIA NIM Multinode Deployment
Prerequisites
- NVIDIA NGC API key β access to NIM container images
- Multiple GPU nodes β each with 8Γ H100 or H200
- High-speed interconnect β InfiniBand HDR/NDR (400-800 Gb/s) between nodes
- Shared storage β for model weights (NFS, Lustre, or NVIDIA GPUDirect Storage)
- NCCL β NVIDIA Collective Communications Library for GPU-to-GPU data transfer
Docker Deployment (2 Nodes)
Node 0 (Leader):
docker run -d --name nim-node0 \
--gpus all \
--network host \
--ipc host \
--ulimit memlock=-1 \
-v /models:/models \
-e NGC_API_KEY="your-ngc-key" \
-e NIM_MODEL_NAME="deepseek-ai/deepseek-r1" \
-e NIM_TENSOR_PARALLEL_SIZE=8 \
-e NIM_PIPELINE_PARALLEL_SIZE=2 \
-e NIM_NODE_RANK=0 \
-e NIM_NUM_NODES=2 \
-e NIM_MASTER_ADDR="10.0.0.1" \
-e NIM_MASTER_PORT=29500 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_SOCKET_IFNAME=ib0 \
-p 8000:8000 \
nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3Node 1 (Worker):
docker run -d --name nim-node1 \
--gpus all \
--network host \
--ipc host \
--ulimit memlock=-1 \
-v /models:/models \
-e NGC_API_KEY="your-ngc-key" \
-e NIM_MODEL_NAME="deepseek-ai/deepseek-r1" \
-e NIM_TENSOR_PARALLEL_SIZE=8 \
-e NIM_PIPELINE_PARALLEL_SIZE=2 \
-e NIM_NODE_RANK=1 \
-e NIM_NUM_NODES=2 \
-e NIM_MASTER_ADDR="10.0.0.1" \
-e NIM_MASTER_PORT=29500 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_SOCKET_IFNAME=ib0 \
nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3The API endpoint is only exposed on Node 0. Node 1 participates in computation but does not serve HTTP requests.
Kubernetes Deployment
For production, deploy multinode NIM on Kubernetes with LeaderWorkerSet:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: deepseek-r1-nim
namespace: inference
spec:
replicas: 1
leaderWorkerTemplate:
size: 2 # 2 nodes total
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
app: deepseek-r1
role: leader
spec:
containers:
- name: nim
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: api-key
- name: NIM_MODEL_NAME
value: "deepseek-ai/deepseek-r1"
- name: NIM_TENSOR_PARALLEL_SIZE
value: "8"
- name: NIM_PIPELINE_PARALLEL_SIZE
value: "2"
- name: NIM_NUM_NODES
value: "2"
- name: NIM_NODE_RANK
value: "0"
- name: NIM_MASTER_ADDR
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NIM_MASTER_PORT
value: "29500"
resources:
limits:
nvidia.com/gpu: 8
ports:
- containerPort: 8000
name: http
volumeMounts:
- name: model-cache
mountPath: /models
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nim-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
workerTemplate:
metadata:
labels:
app: deepseek-r1
role: worker
spec:
containers:
- name: nim
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: api-key
- name: NIM_MODEL_NAME
value: "deepseek-ai/deepseek-r1"
- name: NIM_TENSOR_PARALLEL_SIZE
value: "8"
- name: NIM_PIPELINE_PARALLEL_SIZE
value: "2"
- name: NIM_NUM_NODES
value: "2"
- name: NIM_NODE_RANK
value: "1"
- name: NIM_MASTER_ADDR
# Leader pod IP injected by LeaderWorkerSet
valueFrom:
fieldRef:
fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-address']
- name: NIM_MASTER_PORT
value: "29500"
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: model-cache
mountPath: /models
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nim-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-nim
namespace: inference
spec:
selector:
app: deepseek-r1
role: leader
ports:
- port: 8000
targetPort: 8000
name: httpTesting the Endpoint
Once both nodes are running and synchronized:
curl -X POST http://deepseek-r1-nim:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/deepseek-r1",
"messages": [
{"role": "user", "content": "Explain tensor parallelism in 3 sentences."}
],
"max_tokens": 256,
"temperature": 0.7
}'The response comes from the leader node, but computation happened across both nodes transparently.
Networking: The Bottleneck
Multinode inference performance lives and dies by inter-node bandwidth:
| Interconnect | Bandwidth | Latency | Multinode Viable? |
|---|---|---|---|
| 1 GbE | 125 MB/s | ~500 ΞΌs | β Unusable |
| 25 GbE | 3.1 GB/s | ~10 ΞΌs | β Too slow |
| 100 GbE (RoCE) | 12.5 GB/s | ~2 ΞΌs | β οΈ Small models only |
| InfiniBand HDR | 25 GB/s | ~1 ΞΌs | β Production viable |
| InfiniBand NDR | 50 GB/s | ~1 ΞΌs | β Recommended |
| InfiniBand XDR | 100 GB/s | under 1 ΞΌs | β Optimal |
Minimum for production multinode: InfiniBand HDR (200 Gb/s). Anything less and the inter-node communication overhead dominates inference latency.
NCCL Configuration
# Essential NCCL environment variables for multinode
export NCCL_IB_DISABLE=0 # Enable InfiniBand
export NCCL_SOCKET_IFNAME=ib0 # IB network interface
export NCCL_IB_HCA=mlx5 # Mellanox HCA device
export NCCL_IB_GID_INDEX=3 # RoCE GID index
export NCCL_NET_GDR_LEVEL=5 # GPUDirect RDMA level
export NCCL_DEBUG=INFO # Debug logging (remove in prod)Performance Characteristics
Multinode adds latency. Here are realistic numbers for DeepSeek-R1:
| Configuration | Time to First Token | Tokens/Second | Cost/Hour |
|---|---|---|---|
| 2Γ 8ΓH100 (PP=2) | ~800 ms | ~30-40 t/s | ~$60 |
| 4Γ 8ΓH100 (PP=4) | ~500 ms | ~60-80 t/s | ~$120 |
| 2Γ 8ΓH100 FP8 | ~600 ms | ~50-60 t/s | ~$60 |
Compared to a single-node 70B model at 100+ tokens/second, multinode 671B models are significantly slower and more expensive. The trade-off is capability β DeepSeek-R1βs reasoning quality justifies the infrastructure for complex tasks.
Alternatives to Multinode
Before committing to multinode infrastructure, consider:
Quantization
FP8 quantization halves memory requirements. Llama 405B at FP8 (405 GB) fits on a single 8ΓH100 node. Quality impact is measurable but often acceptable.
# NIM with FP8 quantization β single node
docker run -d --gpus all \
-e NIM_MODEL_NAME="meta/llama-3.1-405b-instruct" \
-e NIM_QUANTIZATION="fp8" \
-e NIM_TENSOR_PARALLEL_SIZE=8 \
nvcr.io/nim/meta/llama-3.1-405b-instruct:latestMixture-of-Experts Sparsity
DeepSeek-R1 is a 671B MoE model but only activates ~37B parameters per token. Clever memory management can page inactive experts to CPU/NVMe, reducing GPU memory needs. This is an active research area (offloading frameworks like DeepSpeed-Inference).
Smaller Distilled Models
DeepSeek-R1 has distilled variants (7B, 14B, 32B, 70B) that capture much of the reasoning capability on a single GPU. For many use cases, the 70B distill on one node outperforms the full 671B on cost-adjusted metrics.
NIM Models with Multinode Support
As of mid-2026, the NGC catalog lists multinode support for these NIM containers:
- deepseek-ai/deepseek-r1 (671B MoE) β the flagship multinode model
- meta/llama-3.1-405b-instruct β largest dense open model
- nvidia/nemotron-340b β NVIDIAβs own large model
- Additional models are being added as NIM expands
The βMultinode Support: Yesβ badge on NGC indicates that the container includes the orchestration logic for multi-node NCCL communication, leader election, and distributed weight loading.
Cost Analysis
Running multinode inference is expensive. Here is a monthly cost comparison:
| Setup | GPU Cost/Month | Model | Tokens/Second |
|---|---|---|---|
| 1Γ 8ΓH100 (70B) | ~$22,000 | Llama 70B | 100+ |
| 1Γ 8ΓH100 (405B FP8) | ~$22,000 | Llama 405B | 40-50 |
| 2Γ 8ΓH100 (671B) | ~$44,000 | DeepSeek-R1 | 30-40 |
| API (DeepSeek) | Pay per token | DeepSeek-R1 | Variable |
When multinode self-hosting makes sense:
- High volume (thousands of requests/hour amortizes fixed cost)
- Data sovereignty requirements (cannot use API)
- Latency requirements (co-located inference)
- Custom fine-tuned large models
When API is better:
- Low-medium volume
- Burst traffic patterns
- No data residency constraints
For a detailed cost model, try the GPU Cost Calculator.
Related Resources
- NVIDIA GPU Operator on Kubernetes
- Multi-Tenant GPUs on Bare Metal
- The Inference Gold Rush
- FinOps for AI: GPU Cost Optimization
- Autoscaling AI Inference on Kubernetes
- Huawei Atlas 950 SuperPoD
- Slurm GPU Cluster Guide
- GPU Kubernetes Guide
About the Author
I am Luca Berton, AI and Cloud Advisor. I design GPU inference infrastructure for enterprises running large language models at scale. Book a consultation to discuss your multinode deployment.