Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Fix NCCL TIMEOUT multi-GPU troubleshooting guide
AI

Fix NCCL_TIMEOUT: Multi-GPU and Multi-Node Troubleshooting

Fix export NCCL_TIMEOUT errors in multi-GPU training and distributed inference. Covers timeout tuning, NCCL environment variables, network debugging, and.

LB
Luca Berton
Β· 2 min read

The error

You are running multi-GPU training or distributed inference and hit:

RuntimeError: NCCL communicator was aborted on rank 0.
Original reason for failure was: Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=12345, OpType=ALLREDUCE, ...) ran for 1800000 milliseconds before timing out.

Or you see:

[Rank 0] NCCL operation timed out after 1800 seconds

NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication in distributed workloads. When a collective operation (AllReduce, AllGather, Broadcast) takes longer than the configured timeout, NCCL aborts.

Quick fix: increase the timeout

The most common fix is increasing NCCL_TIMEOUT (or the PyTorch equivalent):

# Set NCCL timeout to 1 hour (3600 seconds)
export NCCL_TIMEOUT=3600

# For PyTorch distributed
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_BLOCKING_WAIT=0

In PyTorch code:

import torch.distributed as dist
from datetime import timedelta

dist.init_process_group(
    backend="nccl",
    timeout=timedelta(seconds=3600)
)

For vLLM and NIM containers:

# vLLM
export VLLM_NCCL_TIMEOUT=3600000  # milliseconds

# NVIDIA NIM
docker run --gpus all \
  -e NCCL_TIMEOUT=3600 \
  -e VLLM_NCCL_TIMEOUT=3600000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

Common causes and real fixes

Increasing the timeout is a band-aid. Here are the actual root causes:

1. Network misconfiguration between nodes

The most common cause in multi-node setups. NCCL cannot reach other ranks over the expected network interface:

# Force NCCL to use a specific network interface
export NCCL_SOCKET_IFNAME=eth0

# For InfiniBand
export NCCL_IB_HCA=mlx5_0

# For RoCE (RDMA over Converged Ethernet)
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0

Verify connectivity between nodes:

# Check InfiniBand status
ibstat

# Test RDMA connectivity
ib_write_bw -d mlx5_0 --report_gbits

# Check if NCCL can see the network
NCCL_DEBUG=INFO python -c "import torch; torch.cuda.nccl.version()"

2. Firewall blocking NCCL ports

NCCL uses a range of TCP/IP ports for socket-based communication. If ranks cannot connect:

# Default NCCL port range
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=4

# Open firewall for NCCL (typical range)
sudo ufw allow 29400:29500/tcp

# Or in Kubernetes, ensure pods can communicate on all ports
# (most CNI plugins allow this by default within a namespace)

3. GPU memory exhaustion

If a GPU runs out of memory during a collective operation, it will hang rather than crash immediately β€” causing a timeout on other ranks:

# Monitor GPU memory during training
watch -n 1 nvidia-smi

# Set memory fraction limit
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

4. Asymmetric GPU topology

When GPUs within a node have different interconnect speeds (NVLink vs PCIe), NCCL may choose a suboptimal path:

# Check GPU topology
nvidia-smi topo -m

# Force NCCL to use specific algorithms
export NCCL_ALGO=Ring    # or Tree, CollnetDirect, CollnetChain
export NCCL_PROTO=Simple # or LL, LL128

5. One rank is slower (straggler)

Data loading, preprocessing, or gradient computation takes longer on one rank. All other ranks wait at the collective barrier:

# Enable async error handling to avoid indefinite hangs
export NCCL_ASYNC_ERROR_HANDLING=1

# Log per-rank timing
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Essential NCCL environment variables

VariableDefaultDescription
NCCL_TIMEOUT1800Timeout in seconds for collective operations
NCCL_DEBUGWARNLogging level: VERSION, WARN, INFO, TRACE
NCCL_DEBUG_SUBSYSALLWhich subsystems to log
NCCL_SOCKET_IFNAMEautoNetwork interface for socket communication
NCCL_IB_HCAautoInfiniBand HCA device to use
NCCL_IB_GID_INDEX0GID index for RoCE
NCCL_IB_DISABLE0Set to 1 to force TCP instead of IB
NCCL_P2P_DISABLE0Disable GPU peer-to-peer (NVLink/PCIe)
NCCL_SHM_DISABLE0Disable shared memory transport
NCCL_ALGOautoAlgorithm: Ring, Tree, CollnetDirect
NCCL_PROTOautoProtocol: LL, LL128, Simple
NCCL_NET_GDR_LEVELautoGPU Direct RDMA level
NCCL_CROSS_NIC0Allow cross-NIC communication
NCCL_SOCKET_NTHREADS1Threads per socket

Kubernetes-specific configuration

In Kubernetes multi-node GPU workloads, NCCL issues often come from pod networking:

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: training
      env:
        - name: NCCL_TIMEOUT
          value: "3600"
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        - name: NCCL_IB_DISABLE
          value: "0"
      resources:
        limits:
          nvidia.com/gpu: 8
          rdma/rdma_shared_device_a: 1
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]

For NVIDIA NIM multi-node deployments, the Helm chart handles most NCCL configuration. Override with:

# values.yaml
env:
  - name: NCCL_TIMEOUT
    value: "3600"
  - name: NCCL_DEBUG
    value: "INFO"

Debugging workflow

  1. Enable NCCL_DEBUG=INFO β€” see which transport NCCL selects and where it gets stuck
  2. Check nvidia-smi topo -m β€” verify GPU interconnect topology
  3. Test network β€” ib_write_bw for RDMA, iperf3 for TCP
  4. Check firewall β€” ensure all ranks can reach each other
  5. Monitor GPU memory β€” out-of-memory on one rank causes timeout on others
  6. Increase timeout temporarily β€” if the operation eventually completes, the issue is performance not connectivity

Free 30-min AI & Cloud consultation

Book Now