Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Blog post thumbnail
Platform Engineering

Slurm for GPU Clusters: The Workload Manager

Slurm is the dominant workload manager for GPU clusters and HPC. How to configure it for NVIDIA GPUs, MIG, and AI training jobs.

LB
Luca Berton
Β· 3 min read

If you are running GPU infrastructure at scale, you will eventually meet Slurm. It is the workload manager behind six of the top ten supercomputers in the world, and it is the standard scheduler for AI training clusters running NVIDIA hardware.

I have worked with Slurm in environments ranging from a handful of GPU nodes to multi-thousand node clusters. This is how it works, why it dominates, and how to configure it for modern AI workloads.

What Slurm Actually Does

Slurm (Simple Linux Utility for Resource Management) handles three things:

  1. Resource allocation β€” decides which nodes and GPUs your job gets
  2. Job scheduling β€” queues, prioritizes, and dispatches work
  3. Job monitoring β€” tracks running jobs, handles failures, collects accounting data

It is open source, maintained by SchedMD, and scales to millions of cores and tens of thousands of GPUs.

Why Slurm Dominates GPU/AI Infrastructure

Kubernetes is great for microservices. But for HPC and large-scale AI training, Slurm wins because:

  • Native GPU awareness β€” Slurm understands NVIDIA GPUs as first-class resources (GRES)
  • MIG support β€” schedule jobs on specific Multi-Instance GPU partitions
  • Topology-aware scheduling β€” places jobs on nodes connected by NVLink or InfiniBand for maximum throughput
  • No container overhead β€” jobs run directly on bare metal when needed
  • Mature accounting β€” track GPU-hours by user, project, or department

Core Architecture

A minimal Slurm cluster has three components:

slurmctld (controller) β†’ manages the cluster
slurmd (daemon)        β†’ runs on each compute node
slurmdbd (database)    β†’ optional accounting daemon

The controller is the brain. It maintains the job queue, tracks node states, and dispatches work. Every compute node runs slurmd, which receives and executes jobs.

Configuring Slurm for NVIDIA GPUs

GPU resources are configured through the Generic Resource (GRES) system. You need two files.

slurm.conf β€” Cluster Configuration

# Define GPU resources on nodes
GresTypes=gpu
NodeName=gpu-node[01-08] Gres=gpu:a100:8 CPUs=128 RealMemory=1024000
PartitionName=training Nodes=gpu-node[01-08] Default=YES MaxTime=72:00:00

gres.conf β€” GPU Details Per Node

# /etc/slurm/gres.conf on each GPU node
AutoDetect=nvml
Name=gpu Type=a100 File=/dev/nvidia[0-7]

The AutoDetect=nvml option uses NVIDIA Management Library to automatically detect GPU properties. This is the recommended approach for modern setups.

Submitting GPU Jobs

A basic GPU training job submission:

#!/bin/bash
#SBATCH --job-name=llm-finetune
#SBATCH --partition=training
#SBATCH --nodes=4
#SBATCH --gres=gpu:a100:8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --mem=900G
#SBATCH --time=48:00:00
#SBATCH --output=training_%j.log

module load cuda/12.4
module load nccl/2.21

srun torchrun \
  --nproc_per_node=8 \
  --nnodes=4 \
  --rdzv_backend=c10d \
  train.py --config configs/finetune.yaml

Submit it:

sbatch train.sh

Check GPU utilization across the cluster:

squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.6D %R %b"

MIG Partitioning with Slurm

For multi-tenant clusters, MIG partitioning lets you split A100 or H100 GPUs into isolated instances. Slurm supports this natively:

# gres.conf for MIG instances
Name=gpu Type=a100_3g.20gb File=/dev/nvidia[0-7] Cores=[0-1]
Name=gpu Type=a100_1g.5gb File=/dev/nvidia[0-7] Cores=[2-3]

Users request specific MIG profiles:

#SBATCH --gres=gpu:a100_3g.20gb:1

This is how you run inference workloads alongside training without interference.

Slurm vs Kubernetes for AI Workloads

AspectSlurmKubernetes
Best forHPC, large training jobsMicroservices, inference
GPU schedulingNative GRES, topology-awareDevice plugin, less mature
Multi-node trainingFirst-class supportRequires operators (MPI, PyTorch)
OverheadMinimalContainer + orchestration
Learning curveHPC communityCloud-native community

Many organizations run both: Slurm for training, Kubernetes for inference and serving. The NVIDIA GPU Operator bridges the Kubernetes side.

InfiniBand and RDMA Integration

For distributed training, network topology matters. Slurm’s topology-aware scheduling places multi-node jobs on nodes sharing the same InfiniBand fabric:

# topology.conf
SwitchName=leaf1 Nodes=gpu-node[01-04]
SwitchName=leaf2 Nodes=gpu-node[05-08]
SwitchName=spine1 Switches=leaf1:leaf2

Combined with MOFED drivers and RDMA networking, this ensures your all-reduce operations run at line rate.

Job Accounting and GPU Tracking

Enable accounting to track GPU usage by team:

# slurm.conf
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdb01
JobAcctGatherType=jobacct_gather/linux

Query GPU-hours per user:

sacct --format=JobID,User,Partition,AllocGRES,Elapsed,State \
  --starttime=2026-01-01 --endtime=2026-03-01

This is essential for chargeback models in shared GPU clusters.

Pyxis and Enroot: Containers on Slurm

You do not have to choose between bare metal and containers. NVIDIA’s Pyxis plugin lets you run container images directly within Slurm jobs:

srun --container-image=nvcr.io/nvidia/pytorch:24.03-py3 \
     --container-mounts=/data:/data \
     python train.py

Enroot handles the container runtime underneath. It is faster than Docker for HPC workloads because it avoids the daemon overhead.

Getting Started

For a small GPU cluster:

  1. Install Slurm packages from SchedMD
  2. Configure slurm.conf with your node inventory
  3. Set up gres.conf with AutoDetect=nvml
  4. Start slurmctld on the controller, slurmd on compute nodes
  5. Submit your first job with sbatch

For production clusters, consider Ansible automation to manage Slurm configuration across hundreds of nodes. Tools like DeepOps provide ready-made playbooks.

What I Recommend

If you are building a GPU cluster for AI training, start with Slurm. It is battle-tested, the NVIDIA ecosystem assumes it, and the operational overhead is lower than Kubernetes for batch workloads.

If you need both training and inference, run Slurm for the training partition and Kubernetes for the inference serving layer. The GPUs can even be the same physical hardware, managed by different schedulers on different node pools.

For consulting on GPU infrastructure architecture, check my services or connect on LinkedIn.

Free 30-min AI & Cloud consultation

Book Now