Single-node training hits a ceiling fast. When your model does not fit on eight GPUs, or training takes weeks instead of days, you need multi-node distributed training. Slurm is the scheduler that makes this manageable.
I have debugged enough distributed training failures to know where things break. This is a practical guide to getting multi-node GPU training working reliably with Slurm.
The Distributed Training Stack
A multi-node training job involves several layers:
Your training script (PyTorch / DeepSpeed)
โ
torch.distributed / DeepSpeed runtime
โ
NCCL (GPU-to-GPU communication)
โ
InfiniBand / RoCE (network fabric)
โ
Slurm (resource allocation and job launch)Slurm handles the bottom layer โ allocating nodes, setting up the environment, launching processes, and cleaning up when the job finishes or fails.
Basic Multi-Node Job Script
Here is a real-world Slurm batch script for distributed PyTorch training:
#!/bin/bash
#SBATCH --job-name=llm-pretrain
#SBATCH --partition=gpu-large
#SBATCH --nodes=8
#SBATCH --gres=gpu:h100:8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=168:00:00
#SBATCH --output=logs/pretrain_%j.log
#SBATCH --error=logs/pretrain_%j.err
# NCCL configuration
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=5
export NCCL_TOPO_DUMP_FILE=/tmp/nccl_topo_${SLURM_JOB_ID}.xml
# Get master node address
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export WORLD_SIZE=$((SLURM_NNODES * 8))
srun --kill-on-bad-exit=1 bash -c '
export LOCAL_RANK=$SLURM_LOCALID
export RANK=$((SLURM_NODEID * 8 + SLURM_LOCALID))
torchrun \
--nproc_per_node=8 \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py \
--model-config configs/7b.yaml \
--data-path /shared/datasets/pile \
--checkpoint-dir /shared/checkpoints/$SLURM_JOB_ID
'Key decisions in this script:
--ntasks-per-node=1withtorchrunhandling per-GPU processes internally--exclusiveensures no other jobs share the nodes--mem=0allocates all available memory- NCCL environment variables tuned for InfiniBand with GPUDirect RDMA
NCCL Tuning for Multi-Node
NCCL (NVIDIA Collective Communications Library) is the communication backbone. Getting it right is the difference between 50% and 90% scaling efficiency.
Essential NCCL Variables
# Enable InfiniBand
export NCCL_IB_DISABLE=0
# Use GPUDirect RDMA (bypass CPU for GPU-to-GPU across nodes)
export NCCL_NET_GDR_LEVEL=5
# Pin to correct InfiniBand port
export NCCL_IB_HCA=mlx5
# Socket interface for out-of-band communication
export NCCL_SOCKET_IFNAME=eth0
# Debug logging (use for troubleshooting, disable in production)
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NETVerifying NCCL Performance
Before running real training, test all-reduce bandwidth:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
srun /usr/local/bin/all_reduce_perf \
-b 8 -e 4G -f 2 -g 1 -c 1 -n 100You should see bandwidth close to your InfiniBand line rate (200 Gbps for HDR, 400 Gbps for NDR). If you see significantly less, check your MOFED driver configuration and network topology.
DeepSpeed on Slurm
DeepSpeed adds ZeRO optimization, pipeline parallelism, and mixed precision. Here is a Slurm script for DeepSpeed:
#!/bin/bash
#SBATCH --job-name=deepspeed-train
#SBATCH --nodes=4
#SBATCH --gres=gpu:a100:8
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
# Generate hostfile for DeepSpeed
scontrol show hostnames $SLURM_JOB_NODELIST | \
while read host; do echo "$host slots=8"; done > /tmp/hostfile_${SLURM_JOB_ID}
srun deepspeed \
--hostfile /tmp/hostfile_${SLURM_JOB_ID} \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
train.py \
--deepspeed \
--deepspeed_config configs/ds_zero3.jsonDeepSpeed ZeRO-3 Config
{
"train_batch_size": 256,
"gradient_accumulation_steps": 4,
"fp16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "none" },
"offload_param": { "device": "none" },
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8
}
}Checkpointing Strategy
Long training jobs will fail. Plan for it:
#SBATCH --signal=B:USR1@120 # Send signal 120s before time limit
#SBATCH --requeue # Automatically requeue on preemption
# In your training script, handle SIGUSR1:
trap 'echo "Saving checkpoint..."; kill -INT $PID' USR1Save checkpoints to shared storage (NFS, Lustre, or GPFS) every N steps. When the job restarts, it resumes from the latest checkpoint.
# In your PyTorch training loop
if step % checkpoint_interval == 0:
torch.save({
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f'/shared/checkpoints/{job_id}/step_{step}.pt')Job Arrays for Hyperparameter Sweeps
Slurm job arrays are perfect for parallel experiments:
#!/bin/bash
#SBATCH --job-name=hparam-sweep
#SBATCH --array=0-19
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
LEARNING_RATES=(1e-5 2e-5 5e-5 1e-4 2e-4)
BATCH_SIZES=(8 16 32 64)
LR_IDX=$((SLURM_ARRAY_TASK_ID / 4))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 4))
python train.py \
--lr ${LEARNING_RATES[$LR_IDX]} \
--batch-size ${BATCH_SIZES[$BS_IDX]} \
--output results/run_${SLURM_ARRAY_TASK_ID}This launches 20 single-GPU jobs covering all learning rate and batch size combinations.
Common Failures and Fixes
NCCL timeout on multi-node: Usually a firewall or network config issue. Ensure all GPU nodes can reach each other on the NCCL port range.
# Check connectivity
srun --nodes=2 bash -c 'hostname && ibstat | grep State'OOM on large models: Use DeepSpeed ZeRO-3 or FSDP to shard model parameters across GPUs.
Hanging jobs: Set NCCL_DEBUG=INFO and check for asymmetric collective operations. One node entering all-reduce while another has not reached it yet is the most common cause.
Slow data loading: Use parallel data loaders and ensure your shared filesystem can handle the I/O. Consider staging data to local NVMe before training starts.
Monitoring GPU Utilization
Track cluster-wide GPU usage:
# Real-time GPU status across all nodes
srun --nodes=$SLURM_JOB_NUM_NODES nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv
# Historical accounting
sacct -j $SLURM_JOB_ID --format=JobID,Elapsed,MaxRSS,TRESUsageInTotFor deeper observability, integrate with OpenTelemetry or NVIDIA DCGM (Data Center GPU Manager).
What I Recommend
Start with a two-node test before scaling to your full cluster. Get NCCL all-reduce working at line rate first โ if the communication layer is slow, nothing else matters.
Use Slurmโs --exclusive flag for training jobs. GPU memory fragmentation from shared nodes causes mysterious OOMs.
Invest in checkpointing from day one. A 72-hour training job that fails at hour 60 with no checkpoint is an expensive lesson.
For help designing your distributed training infrastructure, visit my services page or explore AnsiblePilot for automated GPU cluster provisioning.