When a model does not fit on a single node, you need multi-node deployment. NIM LLM uses Ray for cluster formation and vLLM for distributed execution across nodes. The Helm chart wraps all of this into a LeaderWorkerSet that you deploy with one command.
This guide covers the official NIM multi-node deployment on Kubernetes β the Helm-native approach using Ray, as distinct from the Run:ai orchestrated approach or bare Docker deployment.
Architecture
NIM multi-node uses a leader/worker model:
ββββββββββββββββββββββββββββββ
β Leader Node β
β ββββββββββββ βββββββββββββ
β β Ray Head β β vLLM ββ
β β :6379 β β Server ββ
β ββββββββββββ β :8000 ββ
β βββββββββββββ
β 8Γ GPU (TP=8, stage 0) β
βββββββββββ¬βββββββββββββββββββ
β Ray cluster + NCCL
βββββββββββΌβββββββββββββββββββ
β Worker Node 1 β
β ββββββββββββ β
β βRay Workerβ β
β ββββββββββββ β
β 8Γ GPU (TP=8, stage 1) β
ββββββββββββββββββββββββββββββThe same container image runs on both leader and worker. The role is determined by the command injected at deployment time via the LeaderWorkerSet controller.
Leader starts a Ray head node, downloads the model, and launches vLLM with distributed execution. Workers join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.
Parallelism Strategies
Two strategies split model weights across nodes:
| Strategy | What It Splits | Communication | Best For |
|---|---|---|---|
| Pipeline Parallelism (PP) | Model stages (layers) across nodes | Lower bandwidth, higher latency tolerance | Cross-node (default) |
| Tensor Parallelism (TP) | Individual layers across GPUs | High bandwidth, low latency required | Within-node (default) |
The standard configuration: TP = GPUs per node, PP = number of nodes.
For Llama 3.1 405B on two 8-GPU nodes:
- TP = 8 (split layers across 8 GPUs per node)
- PP = 2 (split model into 2 pipeline stages, one per node)
- Total: 16 GPUs
Multi-Node Tensor Parallelism
In some cases, you can use TP across nodes instead of PP:
- TP = 16, PP = 1
- One tensor-parallel group spans 2 nodes
This requires InfiniBand with RDMA β the continuous cross-node GPU communication demands high bandwidth and low latency that Ethernet cannot deliver.
Prerequisites
Kubernetes cluster with GPU nodes (same GPU type and count per node)
LeaderWorkerSet CRD installed:
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yamlShared storage β PVC with
ReadWriteManyaccess mode (recommended)High-speed networking β InfiniBand or RoCE for optimal NCCL performance
NGC API key as Kubernetes secret:
kubectl create secret generic ngc-api \
--from-literal=NGC_API_KEY=<your-key>Deploy with Helm
Minimal Configuration
Create values.yaml:
image:
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: "latest"
model:
ngcAPISecret: ngc-api
jsonLogging: false # REQUIRED: JSON logging breaks Ray workers
multiNode:
enabled: true
workers: 1 # Worker pods (total nodes = workers + 1 leader)
tensorParallelSize: 8 # GPUs per tensor-parallel group
pipelineParallelSize: 2 # Pipeline stages (typically = number of nodes)
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany # REQUIRED for multi-node shared storage
storageClass: <your-rwx-storage-class>
imagePullSecrets:
- name: nvcr-imagepullInstall:
helm install nim-llm nim-llm/ -f values.yamlCritical: Set model.jsonLogging: false. The NIM JSON log formatter is not available in vLLM Ray worker processes and causes worker initialization to fail.
Profile Selection
Two approaches β pick one:
Option A: TP/PP Values (Recommended)
Set tensorParallelSize and pipelineParallelSize. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables. The correct model profile is selected automatically.
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set multiNode.tensorParallelSize=8 \
--set multiNode.pipelineParallelSize=2Option B: Explicit Profile
Set model.profile to a profile name or hash. The chart injects NIM_MODEL_PROFILE:
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set model.profile=vllm-fp16-tp8-pp2You must use one of these approaches. The chart will fail to render if neither TP/PP nor model.profile is set.
Helm Values Reference
| Parameter | Description | Default |
|---|---|---|
multiNode.enabled | Enable multi-node mode | false |
multiNode.workers | Number of worker pods per replica | 1 |
multiNode.tensorParallelSize | GPUs per TP group (sets NIM_TENSOR_PARALLEL_SIZE) | 0 |
multiNode.pipelineParallelSize | Pipeline stages (sets NIM_PIPELINE_PARALLEL_SIZE) | 0 |
multiNode.ray.port | Ray head node communication port | 6379 |
model.profile | Explicit profile name or hash (sets NIM_MODEL_PROFILE) | "" |
model.ngcAPISecret | K8s secret name with NGC_API_KEY | required |
model.hfTokenSecret | K8s secret name with HF_TOKEN (for HuggingFace models) | "" |
model.jsonLogging | Enable JSON structured logging | true (set to false!) |
Model Storage
Shared Storage (Recommended)
With a ReadWriteMany PVC or NFS:
- Leader downloads model once to shared volume
- Workers detect cached blobs and materialize a local workspace at
/tmp/nim_workspaceβ no network download
Two backends:
PVC with existingClaim β pre-create a RWX PVC. It survives helm uninstall, so the model cache persists across deployments:
persistence:
enabled: true
existingClaim: nim-model-cache
accessMode: ReadWriteManyNFS direct mount:
nfs:
enabled: true
server: nfs-server.internal
path: /exports/nim-modelsIndependent Downloads (Not Recommended)
Without shared storage, each node downloads the model independently to emptyDir. This works but wastes time and bandwidth β Llama 405B is ~400 GB, downloaded N+1 times instead of once.
Model-Free Multi-Node
Deploy any HuggingFace model across nodes using model-free NIM:
image:
repository: nvcr.io/nim/nim-llm
tag: "latest"
model:
modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
ngcAPISecret: ngc-api
hfTokenSecret: hf-token
jsonLogging: false
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: local-nfsThis uses the generic NIM container (nim-llm) instead of a model-specific container, and generates profiles at runtime.
NIM Operator Deployment
The NIM Operator provides a fully automated alternative:
- Manages
NIMServicecustom resource - Automatically generates leader/worker pod specs
- Injects Ray startup commands
- Handles PVC setup through
NIMCache - Manages probes, networking, and secrets
If you are running NIM at scale across multiple models, the Operator reduces operational overhead compared to managing individual Helm releases.
Common Configurations
Llama 3.1 405B β 2 Nodes (TP=8, PP=2)
The standard configuration. Two 8-GPU nodes, 16 GPUs total:
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2DeepSeek-R1 671B β 2 Nodes (TP=8, PP=2)
Same parallelism but larger model, needs more storage:
image:
repository: nvcr.io/nim/deepseek-ai/deepseek-r1
tag: "latest"
persistence:
size: 500Gi # DeepSeek-R1 is ~650 GBMulti-Node TP β 2 Nodes (TP=16, PP=1)
Single TP group spanning both nodes. Requires InfiniBand:
multiNode:
enabled: true
workers: 1
tensorParallelSize: 16
pipelineParallelSize: 14-Node Deployment (TP=8, PP=4)
Four nodes for extremely large models or higher throughput:
multiNode:
enabled: true
workers: 3 # 3 workers + 1 leader = 4 nodes
tensorParallelSize: 8
pipelineParallelSize: 4
resources:
limits:
nvidia.com/gpu: 8Troubleshooting
Workers Cannot Join Ray Cluster
# Check worker logs for Ray connection errors
kubectl logs <worker-pod-name>
# Verify Ray port (6379) is accessible between pods
kubectl exec <leader-pod> -- nc -zv <worker-pod-ip> 6379
# Check LWS_LEADER_ADDRESS injection
kubectl get pods -o yaml | grep LWS_LEADER_ADDRESSCommon causes: network policies blocking port 6379, LWS controller not running, DNS resolution issues.
Model Download Failures
# Verify NGC secret
kubectl get secret ngc-api -o jsonpath='{.data.NGC_API_KEY}' | base64 -d
# Check PVC is bound with correct access mode
kubectl get pvcNCCL Communication Errors
# Enable NCCL debug logging
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_IB_DISABLE
value: "0" # Ensure InfiniBand is enabledFor multi-node TP, InfiniBand/RoCE is effectively required. Ethernet will work but with severe performance degradation.
JSON Logging Crash
If workers crash immediately at startup, check if jsonLogging is enabled:
model:
jsonLogging: false # MUST be false for multi-nodeThe NIM JSON log formatter does not exist in vLLM Ray worker processes.
Related Resources
- NIM Model Profiles Guide
- NIM Multinode Inference (Docker)
- Run:ai Distributed Inference Tutorial
- Run:ai Platform Guide
- NVIDIA GPU Operator on Kubernetes
- The Inference Gold Rush
- On-Premises LLM Deployment
- FinOps for AI GPU Workloads
- Official Docs: Multi-Node Deployment
About the Author
I am Luca Berton, AI and Cloud Advisor. I design multi-node GPU inference architectures for enterprises. Book a consultation.