Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
NVIDIA NIM Multinode Inference Large Models 2026
AI

NVIDIA NIM Multinode: Serving 400B+ Models Across GPUs

When a model is too large for one server, you need multinode inference. NVIDIA NIM multinode deployment with DeepSeek-R1, Llama 405B, tensor parallelism.

LB
Luca Berton
Β· 5 min read

A single 8Γ—H100 node has 640 GB of GPU memory. DeepSeek-R1 at 671 billion parameters needs over 1.2 TB in FP16. The math does not work on one machine.

Multinode inference splits a model across multiple physical servers, connected via high-speed networking, serving requests as if it were a single endpoint. NVIDIA NIM makes this surprisingly straightforward β€” but the infrastructure requirements are serious.

When You Need Multinode

The decision is simple: if the model does not fit in one node’s total GPU memory, you need multinode.

ModelParametersFP16 MemoryFP8 MemorySingle 8Γ—H100?
Llama 3.1 70B70B140 GB70 GBβœ… Yes
Mixtral 8Γ—22B141B (MoE)~90 GB active~45 GBβœ… Yes
Llama 3.1 405B405B810 GB405 GB❌ FP16, βœ… FP8
DeepSeek-R1671B (MoE)1.3 TB671 GB❌ No
Nemotron 340B340B680 GB340 GB❌ FP16, βœ… FP8

The 8Γ—H100 80GB node at 640 GB total is the reference point. Anything above that needs either quantization to fit, or multinode to spread.

Parallelism Strategies

Multinode inference uses two parallelism techniques, often combined:

Tensor Parallelism (TP)

Splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate during every forward pass.

Layer N:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPU 0  β”‚  GPU 1  β”‚  GPU 2  β”‚  GPU 3  β”‚
β”‚ slice 0 β”‚ slice 1 β”‚ slice 2 β”‚ slice 3 β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          AllReduce (every layer)

Within a node: NVLink at 900 GB/s β€” fast enough for TP across 8 GPUs.

Across nodes: InfiniBand at 400 Gb/s (50 GB/s) β€” 18Γ— slower than NVLink. TP across nodes works but adds latency to every single layer computation.

Pipeline Parallelism (PP)

Splits the model by layers. Each node holds a contiguous block of layers. Communication only happens between pipeline stages.

Node 0                    Node 1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layers 0-39     │────▢│  Layers 40-79    β”‚
β”‚  (8Γ— H100)       β”‚     β”‚  (8Γ— H100)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     Stage 0          ──▢      Stage 1
              Inter-node transfer
              (once per stage, not per layer)

Key advantage: PP sends activations between nodes once per stage, not once per layer. This makes it far more tolerant of inter-node bandwidth limitations.

The Practical Combination

Most multinode deployments use TP within nodes + PP across nodes:

2-Node DeepSeek-R1 Deployment:
─────────────────────────────
Node 0: TP=8 across 8Γ— H100 (layers 0-39)
Node 1: TP=8 across 8Γ— H100 (layers 40-79)
PP=2 across the two nodes

Total: TP=8, PP=2 β†’ 16 GPUs

NVIDIA NIM Multinode Deployment

Prerequisites

  • NVIDIA NGC API key β€” access to NIM container images
  • Multiple GPU nodes β€” each with 8Γ— H100 or H200
  • High-speed interconnect β€” InfiniBand HDR/NDR (400-800 Gb/s) between nodes
  • Shared storage β€” for model weights (NFS, Lustre, or NVIDIA GPUDirect Storage)
  • NCCL β€” NVIDIA Collective Communications Library for GPU-to-GPU data transfer

Docker Deployment (2 Nodes)

Node 0 (Leader):

docker run -d --name nim-node0 \
  --gpus all \
  --network host \
  --ipc host \
  --ulimit memlock=-1 \
  -v /models:/models \
  -e NGC_API_KEY="your-ngc-key" \
  -e NIM_MODEL_NAME="deepseek-ai/deepseek-r1" \
  -e NIM_TENSOR_PARALLEL_SIZE=8 \
  -e NIM_PIPELINE_PARALLEL_SIZE=2 \
  -e NIM_NODE_RANK=0 \
  -e NIM_NUM_NODES=2 \
  -e NIM_MASTER_ADDR="10.0.0.1" \
  -e NIM_MASTER_PORT=29500 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_SOCKET_IFNAME=ib0 \
  -p 8000:8000 \
  nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3

Node 1 (Worker):

docker run -d --name nim-node1 \
  --gpus all \
  --network host \
  --ipc host \
  --ulimit memlock=-1 \
  -v /models:/models \
  -e NGC_API_KEY="your-ngc-key" \
  -e NIM_MODEL_NAME="deepseek-ai/deepseek-r1" \
  -e NIM_TENSOR_PARALLEL_SIZE=8 \
  -e NIM_PIPELINE_PARALLEL_SIZE=2 \
  -e NIM_NODE_RANK=1 \
  -e NIM_NUM_NODES=2 \
  -e NIM_MASTER_ADDR="10.0.0.1" \
  -e NIM_MASTER_PORT=29500 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_SOCKET_IFNAME=ib0 \
  nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3

The API endpoint is only exposed on Node 0. Node 1 participates in computation but does not serve HTTP requests.

Kubernetes Deployment

For production, deploy multinode NIM on Kubernetes with LeaderWorkerSet:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: deepseek-r1-nim
  namespace: inference
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2  # 2 nodes total
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          app: deepseek-r1
          role: leader
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NGC_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: api-key
              - name: NIM_MODEL_NAME
                value: "deepseek-ai/deepseek-r1"
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "8"
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: NIM_NUM_NODES
                value: "2"
              - name: NIM_NODE_RANK
                value: "0"
              - name: NIM_MASTER_ADDR
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
              - name: NIM_MASTER_PORT
                value: "29500"
            resources:
              limits:
                nvidia.com/gpu: 8
            ports:
              - containerPort: 8000
                name: http
            volumeMounts:
              - name: model-cache
                mountPath: /models
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
    workerTemplate:
      metadata:
        labels:
          app: deepseek-r1
          role: worker
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NGC_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: api-key
              - name: NIM_MODEL_NAME
                value: "deepseek-ai/deepseek-r1"
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "8"
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: NIM_NUM_NODES
                value: "2"
              - name: NIM_NODE_RANK
                value: "1"
              - name: NIM_MASTER_ADDR
                # Leader pod IP injected by LeaderWorkerSet
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-address']
              - name: NIM_MASTER_PORT
                value: "29500"
            resources:
              limits:
                nvidia.com/gpu: 8
            volumeMounts:
              - name: model-cache
                mountPath: /models
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-nim
  namespace: inference
spec:
  selector:
    app: deepseek-r1
    role: leader
  ports:
    - port: 8000
      targetPort: 8000
      name: http

Testing the Endpoint

Once both nodes are running and synchronized:

curl -X POST http://deepseek-r1-nim:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [
      {"role": "user", "content": "Explain tensor parallelism in 3 sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

The response comes from the leader node, but computation happened across both nodes transparently.

Networking: The Bottleneck

Multinode inference performance lives and dies by inter-node bandwidth:

InterconnectBandwidthLatencyMultinode Viable?
1 GbE125 MB/s~500 μs❌ Unusable
25 GbE3.1 GB/s~10 μs❌ Too slow
100 GbE (RoCE)12.5 GB/s~2 μs⚠️ Small models only
InfiniBand HDR25 GB/s~1 ΞΌsβœ… Production viable
InfiniBand NDR50 GB/s~1 ΞΌsβœ… Recommended
InfiniBand XDR100 GB/sunder 1 ΞΌsβœ… Optimal

Minimum for production multinode: InfiniBand HDR (200 Gb/s). Anything less and the inter-node communication overhead dominates inference latency.

NCCL Configuration

# Essential NCCL environment variables for multinode
export NCCL_IB_DISABLE=0           # Enable InfiniBand
export NCCL_SOCKET_IFNAME=ib0      # IB network interface
export NCCL_IB_HCA=mlx5            # Mellanox HCA device
export NCCL_IB_GID_INDEX=3         # RoCE GID index
export NCCL_NET_GDR_LEVEL=5        # GPUDirect RDMA level
export NCCL_DEBUG=INFO             # Debug logging (remove in prod)

Performance Characteristics

Multinode adds latency. Here are realistic numbers for DeepSeek-R1:

ConfigurationTime to First TokenTokens/SecondCost/Hour
2Γ— 8Γ—H100 (PP=2)~800 ms~30-40 t/s~$60
4Γ— 8Γ—H100 (PP=4)~500 ms~60-80 t/s~$120
2Γ— 8Γ—H100 FP8~600 ms~50-60 t/s~$60

Compared to a single-node 70B model at 100+ tokens/second, multinode 671B models are significantly slower and more expensive. The trade-off is capability β€” DeepSeek-R1’s reasoning quality justifies the infrastructure for complex tasks.

Alternatives to Multinode

Before committing to multinode infrastructure, consider:

Quantization

FP8 quantization halves memory requirements. Llama 405B at FP8 (405 GB) fits on a single 8Γ—H100 node. Quality impact is measurable but often acceptable.

# NIM with FP8 quantization β€” single node
docker run -d --gpus all \
  -e NIM_MODEL_NAME="meta/llama-3.1-405b-instruct" \
  -e NIM_QUANTIZATION="fp8" \
  -e NIM_TENSOR_PARALLEL_SIZE=8 \
  nvcr.io/nim/meta/llama-3.1-405b-instruct:latest

Mixture-of-Experts Sparsity

DeepSeek-R1 is a 671B MoE model but only activates ~37B parameters per token. Clever memory management can page inactive experts to CPU/NVMe, reducing GPU memory needs. This is an active research area (offloading frameworks like DeepSpeed-Inference).

Smaller Distilled Models

DeepSeek-R1 has distilled variants (7B, 14B, 32B, 70B) that capture much of the reasoning capability on a single GPU. For many use cases, the 70B distill on one node outperforms the full 671B on cost-adjusted metrics.

NIM Models with Multinode Support

As of mid-2026, the NGC catalog lists multinode support for these NIM containers:

  • deepseek-ai/deepseek-r1 (671B MoE) β€” the flagship multinode model
  • meta/llama-3.1-405b-instruct β€” largest dense open model
  • nvidia/nemotron-340b β€” NVIDIA’s own large model
  • Additional models are being added as NIM expands

The β€œMultinode Support: Yes” badge on NGC indicates that the container includes the orchestration logic for multi-node NCCL communication, leader election, and distributed weight loading.

Cost Analysis

Running multinode inference is expensive. Here is a monthly cost comparison:

SetupGPU Cost/MonthModelTokens/Second
1Γ— 8Γ—H100 (70B)~$22,000Llama 70B100+
1Γ— 8Γ—H100 (405B FP8)~$22,000Llama 405B40-50
2Γ— 8Γ—H100 (671B)~$44,000DeepSeek-R130-40
API (DeepSeek)Pay per tokenDeepSeek-R1Variable

When multinode self-hosting makes sense:

  • High volume (thousands of requests/hour amortizes fixed cost)
  • Data sovereignty requirements (cannot use API)
  • Latency requirements (co-located inference)
  • Custom fine-tuned large models

When API is better:

  • Low-medium volume
  • Burst traffic patterns
  • No data residency constraints

For a detailed cost model, try the GPU Cost Calculator.

About the Author

I am Luca Berton, AI and Cloud Advisor. I design GPU inference infrastructure for enterprises running large language models at scale. Book a consultation to discuss your multinode deployment.

Free 30-min AI & Cloud consultation

Book Now