Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Distributed inference vs multi-GPU inference architecture comparison
AI

Distributed Inference vs Multi-GPU Inference

A practical comparison of multi-GPU inference within a single node versus distributed inference across multiple nodes. Covers tensor parallelism, pipeline.

LB
Luca Berton
Β· 5 min read

When a model doesn’t fit on a single GPU, you have two paths: multi-GPU inference (multiple GPUs in one server) or distributed inference (multiple servers connected by network). The choice determines your latency, throughput, cost, and failure modes.

The Core Difference

AspectMulti-GPU (Single Node)Distributed (Multi-Node)
InterconnectNVLink/NVSwitch (900 GB/s)InfiniBand/RoCE (400 Gb/s)
Latency per hop1-5 ΞΌs5-50 ΞΌs
Max GPUs8 (DGX H100)Hundreds
Failure domainSingle machinePer-node isolation
ComplexityModerateHigh
Cost per GPUHigher (DGX premium)Lower (commodity nodes)

Multi-GPU Inference: Single Node

When to Use

  • Model fits in 2-8 GPUs worth of VRAM
  • Latency is critical (sub-100ms Time-To-First-Token)
  • You have DGX or HGX systems with NVLink

Tensor Parallelism (TP)

Split each layer’s weight matrices across GPUs. Every GPU computes part of every token:

# vLLM with tensor parallelism on 4 GPUs
apiVersion: apps/v1
kind: Deployment
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args:
        - --model=meta-llama/Llama-3.1-70B
        - --tensor-parallel-size=4
        - --gpu-memory-utilization=0.92
        - --max-model-len=32768
      resources:
        limits:
          nvidia.com/gpu: 4

How it works:

Input tokens
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layer N: Weight matrix split 4 ways β”‚
β”‚  GPU 0: W[:,0:d/4]                   β”‚
β”‚  GPU 1: W[:,d/4:d/2]                 β”‚
β”‚  GPU 2: W[:,d/2:3d/4]               β”‚
β”‚  GPU 3: W[:,3d/4:d]                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚  AllReduce across NVLink (900 GB/s)
     β–Ό
  Combined output β†’ next layer

Pros:

  • Lowest latency β€” all GPUs work on every token simultaneously
  • NVLink bandwidth makes AllReduce nearly free
  • vLLM, TensorRT-LLM, and NVIDIA NIM handle it natively

Cons:

  • Limited to GPUs within one node (typically 8 max)
  • AllReduce frequency = once per layer = high communication
  • If one GPU fails, entire inference stops

Memory Math

For Llama 3.1 70B in FP16:

  • Parameters: 70B Γ— 2 bytes = 140 GB
  • KV cache (32K context, batch 8): ~40 GB
  • Activations: ~10 GB
  • Total: ~190 GB β†’ fits on 4Γ— H100 80GB

Distributed Inference: Multi-Node

When to Use

  • Model exceeds single-node GPU memory (405B+ parameters)
  • You need massive throughput (hundreds of concurrent users)
  • Running on commodity GPU nodes without NVLink
  • Building fault-tolerant inference clusters

Pipeline Parallelism (PP)

Split layers across nodes. Each node processes a subset of layers sequentially:

# NVIDIA NIM multi-node deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: llama-405b-node
spec:
  replicas: 2  # 2 nodes Γ— 8 GPUs = 16 GPUs total
  template:
    spec:
      containers:
        - name: nim
          env:
            - name: TENSOR_PARALLEL_SIZE
              value: "8"
            - name: PIPELINE_PARALLEL_SIZE
              value: "2"
            - name: NIM_LEADER_ADDRESS
              value: "llama-405b-node-0.llama-405b-node:5557"

How it works:

Node 0 (Layers 0-39)          Node 1 (Layers 40-79)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GPU 0-7: TP=8    │──────────│ GPU 0-7: TP=8    β”‚
β”‚ Layers 0-39      β”‚ Network  β”‚ Layers 40-79     β”‚
β”‚ (NVLink internal)β”‚ (IB/RoCE)β”‚ (NVLink internal)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pros:

  • Scales to arbitrary model sizes (405B, 1T+)
  • Each node only holds partial weights β€” lower per-node memory
  • Can use cheaper nodes without NVLink interconnect between nodes
  • Micro-batching hides network latency

Cons:

  • Pipeline bubbles β€” GPUs idle while waiting for activations
  • Network bandwidth becomes bottleneck for large activations
  • Higher TTFT due to sequential node processing
  • Complex orchestration and failure handling

Expert Parallelism (EP) for MoE Models

For Mixture-of-Experts models (Mixtral, DeepSeek-V3):

Node 0: Experts 0-15     Node 1: Experts 16-31
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Router selects   β”‚     β”‚ Router selects   β”‚
β”‚ top-K experts    │────▢│ top-K experts    β”‚
β”‚ for each token   │◀────│ for each token   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Only activated experts need computation β€” reducing total FLOPs while maintaining model capacity.

The Network Bottleneck

The critical difference between multi-GPU and distributed is interconnect bandwidth:

InterconnectBandwidthLatencyUse Case
NVLink 4.0 (H100)900 GB/s~1 ΞΌsIntra-node TP
NVSwitch (DGX)900 GB/s all-to-all~1 ΞΌsIntra-node TP
InfiniBand HDR200 Gb/s (25 GB/s)~2 ΞΌsMulti-node PP
InfiniBand NDR400 Gb/s (50 GB/s)~2 ΞΌsMulti-node PP
RoCE v2100-400 Gb/s~5 ΞΌsBudget multi-node
TCP/Ethernet25-100 Gb/s~50 ΞΌsDon’t do this

For a 70B model with TP=8 across nodes (not recommended):

  • AllReduce per layer: ~8 GB of data
  • At 400 Gb/s: 160ms per layer Γ— 80 layers = 12.8 seconds per token
  • At 900 GB/s NVLink: 9ms per layer Γ— 80 layers = 0.7 seconds per token

Rule of thumb: Tensor parallelism should stay within NVLink. Use pipeline parallelism across nodes.

NVIDIA Dynamo: Disaggregated Serving

NVIDIA Dynamo introduces a third option β€” disaggregated prefill and decode:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Prefill Nodes   β”‚     β”‚ Decode Nodes    β”‚
β”‚ (Compute-heavy) │────▢│ (Memory-bound)  β”‚
β”‚ 8Γ— H100 each    β”‚ KV  β”‚ 4Γ— H100 each   β”‚
β”‚ Batch prefill    β”‚cacheβ”‚ Continuous decodeβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Prefill is compute-bound β†’ pack onto fewer, fully-utilized GPUs
  • Decode is memory-bandwidth-bound β†’ spread across more GPUs with lighter load
  • KV cache transfers between nodes via RDMA

This disaggregation can improve throughput per dollar by 2-4x compared to monolithic serving.

Decision Framework

Model Size β†’ Architecture

ModelVRAM Needed (FP16)Recommended Setup
7-8B16 GB1Γ— GPU
13B26 GB1Γ— H100 or 2Γ— A100
34B68 GB1Γ— H100 80GB
70B140 GB2Γ— H100 (TP=2) or 4Γ— A100 (TP=4)
70B + long context200 GB4Γ— H100 (TP=4)
405B810 GB2 nodes Γ— 8 H100 (TP=8, PP=2)
1T+ MoE1+ TB4+ nodes with EP

Latency vs Throughput Optimization

Optimize for latency (chatbots, real-time):

  • Maximize tensor parallelism within one node
  • Use larger GPUs (H100 80GB over A100 40GB)
  • Sacrifice throughput for faster single-request response

Optimize for throughput (batch processing, offline):

  • Use pipeline parallelism with micro-batching
  • Fill pipeline bubbles with concurrent requests
  • Continuous batching to maximize GPU utilization
  • Consider autoscaling inference based on queue depth

KV Cache: The Hidden Bottleneck

At 128K context length, KV cache dominates memory:

KV cache per token per layer:
  2 (K+V) Γ— num_heads Γ— head_dim Γ— 2 bytes (FP16)

Llama 3.1 70B at 128K context:
  2 Γ— 64 Γ— 128 Γ— 2 Γ— 80 layers Γ— 128,000 tokens = 167 GB

That's MORE than the model weights (140 GB)!

Strategies:

  • PagedAttention (vLLM): Allocate KV cache in pages, avoid fragmentation
  • KV cache quantization: FP8 or INT8 reduces cache by 2-4x
  • Prefix caching: Share KV cache for common system prompts
  • Offloading: Spill cold KV cache to CPU RAM or SSD

Multi-GPU on Kubernetes

# Multi-GPU single node (simple)
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      resources:
        limits:
          nvidia.com/gpu: 8
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3,4,5,6,7"
---
# Multi-node distributed (complex)
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llama-405b
spec:
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    workerTemplate:
      spec:
        containers:
          - name: nim
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/rdma_shared_device: 1

For multi-tenant GPU sharing, tools like MIG (Multi-Instance GPU), time-slicing, and MPS allow multiple inference workloads on the same physical GPU.

Cost Comparison

Running Llama 3.1 70B inference (1000 req/min target):

SetupHardwareMonthly Cost*Throughput$/1M tokens
4Γ— H100 SXM (1 node)DGX H100~$35K1200 req/min$0.42
8Γ— A100 80GB (1 node)DGX A100~$25K800 req/min$0.47
2Γ— 4-GPU nodes (A100)Commodity~$18K700 req/min$0.39
Cloud API (GPT-4o)-Usage-basedUnlimited$2.50

*Approximate cloud instance pricing, varies by provider.

The commodity multi-node approach can be cheaper per token, but operational complexity is higher.

Failure Modes

FailureMulti-GPU ImpactDistributed Impact
Single GPU faultEntire node downOne node down, others continue
Memory error (ECC)Process crashAffected node restarts
Network partitionN/ASplit-brain, request failures
Thermal throttlingReduced throughputAffected node slower

Distributed systems offer better fault isolation but introduce network partition risks. For production, implement:

  • Health checks per node
  • Automatic failover to backup nodes
  • Request retry with timeout
  • Graceful degradation (serve smaller model as fallback)

Start with the smallest setup that meets your latency SLA. Scale horizontally only when single-node capacity is exhausted. The best GPU cluster is the one you don’t over-provision.

Free 30-min AI & Cloud consultation

Book Now