If you are running GPU infrastructure at scale, you will eventually meet Slurm. It is the workload manager behind six of the top ten supercomputers in the world, and it is the standard scheduler for AI training clusters running NVIDIA hardware.
I have worked with Slurm in environments ranging from a handful of GPU nodes to multi-thousand node clusters. This is how it works, why it dominates, and how to configure it for modern AI workloads.
What Slurm Actually Does
Slurm (Simple Linux Utility for Resource Management) handles three things:
- Resource allocation β decides which nodes and GPUs your job gets
- Job scheduling β queues, prioritizes, and dispatches work
- Job monitoring β tracks running jobs, handles failures, collects accounting data
It is open source, maintained by SchedMD, and scales to millions of cores and tens of thousands of GPUs.
Why Slurm Dominates GPU/AI Infrastructure
Kubernetes is great for microservices. But for HPC and large-scale AI training, Slurm wins because:
- Native GPU awareness β Slurm understands NVIDIA GPUs as first-class resources (GRES)
- MIG support β schedule jobs on specific Multi-Instance GPU partitions
- Topology-aware scheduling β places jobs on nodes connected by NVLink or InfiniBand for maximum throughput
- No container overhead β jobs run directly on bare metal when needed
- Mature accounting β track GPU-hours by user, project, or department
Core Architecture
A minimal Slurm cluster has three components:
slurmctld (controller) β manages the cluster
slurmd (daemon) β runs on each compute node
slurmdbd (database) β optional accounting daemonThe controller is the brain. It maintains the job queue, tracks node states, and dispatches work. Every compute node runs slurmd, which receives and executes jobs.
Configuring Slurm for NVIDIA GPUs
GPU resources are configured through the Generic Resource (GRES) system. You need two files.
slurm.conf β Cluster Configuration
# Define GPU resources on nodes
GresTypes=gpu
NodeName=gpu-node[01-08] Gres=gpu:a100:8 CPUs=128 RealMemory=1024000
PartitionName=training Nodes=gpu-node[01-08] Default=YES MaxTime=72:00:00gres.conf β GPU Details Per Node
# /etc/slurm/gres.conf on each GPU node
AutoDetect=nvml
Name=gpu Type=a100 File=/dev/nvidia[0-7]The AutoDetect=nvml option uses NVIDIA Management Library to automatically detect GPU properties. This is the recommended approach for modern setups.
Submitting GPU Jobs
A basic GPU training job submission:
#!/bin/bash
#SBATCH --job-name=llm-finetune
#SBATCH --partition=training
#SBATCH --nodes=4
#SBATCH --gres=gpu:a100:8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --mem=900G
#SBATCH --time=48:00:00
#SBATCH --output=training_%j.log
module load cuda/12.4
module load nccl/2.21
srun torchrun \
--nproc_per_node=8 \
--nnodes=4 \
--rdzv_backend=c10d \
train.py --config configs/finetune.yamlSubmit it:
sbatch train.shCheck GPU utilization across the cluster:
squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.6D %R %b"MIG Partitioning with Slurm
For multi-tenant clusters, MIG partitioning lets you split A100 or H100 GPUs into isolated instances. Slurm supports this natively:
# gres.conf for MIG instances
Name=gpu Type=a100_3g.20gb File=/dev/nvidia[0-7] Cores=[0-1]
Name=gpu Type=a100_1g.5gb File=/dev/nvidia[0-7] Cores=[2-3]Users request specific MIG profiles:
#SBATCH --gres=gpu:a100_3g.20gb:1This is how you run inference workloads alongside training without interference.
Slurm vs Kubernetes for AI Workloads
| Aspect | Slurm | Kubernetes |
|---|---|---|
| Best for | HPC, large training jobs | Microservices, inference |
| GPU scheduling | Native GRES, topology-aware | Device plugin, less mature |
| Multi-node training | First-class support | Requires operators (MPI, PyTorch) |
| Overhead | Minimal | Container + orchestration |
| Learning curve | HPC community | Cloud-native community |
Many organizations run both: Slurm for training, Kubernetes for inference and serving. The NVIDIA GPU Operator bridges the Kubernetes side.
InfiniBand and RDMA Integration
For distributed training, network topology matters. Slurmβs topology-aware scheduling places multi-node jobs on nodes sharing the same InfiniBand fabric:
# topology.conf
SwitchName=leaf1 Nodes=gpu-node[01-04]
SwitchName=leaf2 Nodes=gpu-node[05-08]
SwitchName=spine1 Switches=leaf1:leaf2Combined with MOFED drivers and RDMA networking, this ensures your all-reduce operations run at line rate.
Job Accounting and GPU Tracking
Enable accounting to track GPU usage by team:
# slurm.conf
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdb01
JobAcctGatherType=jobacct_gather/linuxQuery GPU-hours per user:
sacct --format=JobID,User,Partition,AllocGRES,Elapsed,State \
--starttime=2026-01-01 --endtime=2026-03-01This is essential for chargeback models in shared GPU clusters.
Pyxis and Enroot: Containers on Slurm
You do not have to choose between bare metal and containers. NVIDIAβs Pyxis plugin lets you run container images directly within Slurm jobs:
srun --container-image=nvcr.io/nvidia/pytorch:24.03-py3 \
--container-mounts=/data:/data \
python train.pyEnroot handles the container runtime underneath. It is faster than Docker for HPC workloads because it avoids the daemon overhead.
Getting Started
For a small GPU cluster:
- Install Slurm packages from SchedMD
- Configure
slurm.confwith your node inventory - Set up
gres.confwithAutoDetect=nvml - Start
slurmctldon the controller,slurmdon compute nodes - Submit your first job with
sbatch
For production clusters, consider Ansible automation to manage Slurm configuration across hundreds of nodes. Tools like DeepOps provide ready-made playbooks.
What I Recommend
If you are building a GPU cluster for AI training, start with Slurm. It is battle-tested, the NVIDIA ecosystem assumes it, and the operational overhead is lower than Kubernetes for batch workloads.
If you need both training and inference, run Slurm for the training partition and Kubernetes for the inference serving layer. The GPUs can even be the same physical hardware, managed by different schedulers on different node pools.
For consulting on GPU infrastructure architecture, check my services or connect on LinkedIn.