Storage is the silent bottleneck in distributed AI training. Your GPUs can process tokens at teraflops, but if the data pipeline canβt feed them fast enough, those expensive accelerators sit idle. Before committing to a storage solution for multi-node training, benchmark it properly with FIO.
Why Storage Matters for AI Training
In a typical training pipeline:
Storage (PVC) β DataLoader workers β CPU preprocessing β GPU batch
β β
βββ THIS is often the bottleneck βββββββββββββKey storage operations during training:
- Sequential read: Loading large dataset files (images, parquet, tokenized text)
- Random read: Shuffled access to training samples
- Checkpoint write: Saving model state every N steps (can be 10-200 GB)
- Logging: Continuous small writes (metrics, profiler data)
FIO on Kubernetes: Multi-Pod Benchmark
Deploy FIO pods that each target the same PVC (simulating multi-node training):
apiVersion: batch/v1
kind: Job
metadata:
name: fio-benchmark
spec:
parallelism: 4 # Simulate 4 training nodes
template:
spec:
containers:
- name: fio
image: nixery.dev/fio
command: ["fio"]
args:
- "--name=ai-training-read"
- "--directory=/data/benchmark"
- "--rw=read"
- "--bs=1M"
- "--size=10G"
- "--numjobs=4"
- "--iodepth=32"
- "--direct=1"
- "--group_reporting"
- "--output-format=json"
volumeMounts:
- name: shared-storage
mountPath: /data
volumes:
- name: shared-storage
persistentVolumeClaim:
claimName: ai-training-data
restartPolicy: NeverBenchmark Profiles for AI Workloads
Profile 1: Dataset Loading (Sequential Read)
Simulates loading large training files (HDF5, Parquet, WebDataset):
[ai-dataset-load]
directory=/data/benchmark
rw=read
bs=1M
size=10G
numjobs=4
iodepth=32
direct=1
runtime=120
time_based=1
group_reporting=1Target: 2+ GB/s per node for large model training, 500 MB/s minimum for vision workloads.
Profile 2: Shuffled Sample Access (Random Read)
Simulates PyTorch DataLoader with shuffle=True on image datasets:
[ai-random-read]
directory=/data/benchmark
rw=randread
bs=256K
size=10G
numjobs=8
iodepth=16
direct=1
runtime=120
time_based=1
group_reporting=1Target: 50K+ IOPS for fast random sample access. Below 10K IOPS, the GPU will starve.
Profile 3: Checkpoint Writing (Sequential Write)
Simulates saving model checkpoints (10-200 GB bursts):
[ai-checkpoint]
directory=/data/benchmark
rw=write
bs=4M
size=50G
numjobs=1
iodepth=16
direct=1
group_reporting=1Target: 1+ GB/s write throughput. Checkpoint saves should complete in under 60 seconds to minimize training interruption.
Profile 4: Mixed Workload (Read + Checkpoint)
Simulates training with periodic checkpoint saves:
[ai-mixed]
directory=/data/benchmark
rw=randrw
rwmixread=90
bs=1M
size=10G
numjobs=4
iodepth=16
direct=1
runtime=120
time_based=1
group_reporting=1NFS vs SMB for AI Training
NFS (Network File System)
apiVersion: v1
kind: PersistentVolume
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
nfs:
server: storage-node.internal
path: /exports/ai-training
mountOptions:
- nfsvers=4.2
- rsize=1048576
- wsize=1048576
- hard
- intr
- noatime
- tcpNFS tuning for AI workloads:
rsize=1048576/wsize=1048576β 1MB read/write buffers (default 64KB is too small)noatimeβ skip access time updates (reduces metadata overhead)tcpβ reliable transport (UDP drops packets under high load)nfsvers=4.2β supports parallel NFS (pNFS) for multi-path I/O
SMB (CIFS)
apiVersion: v1
kind: PersistentVolume
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
csi:
driver: smb.csi.k8s.io
volumeAttributes:
source: //storage-node/ai-training
nodeStageSecretRef:
name: smb-creds
mountOptions:
- dir_mode=0755
- file_mode=0644
- vers=3.0
- multichannel
- max_channels=4SMB tuning:
multichannel+max_channels=4β aggregate bandwidth across multiple network connectionsvers=3.0β required for multichannel and encryption support
Benchmark Comparison
Typical results on enterprise storage (measured with FIO):
| Metric | NFS v4.2 | SMB 3.0 | PScale (parallel) |
|---|---|---|---|
| Sequential Read (1 pod) | 1.2 GB/s | 900 MB/s | 3.5 GB/s |
| Sequential Read (4 pods) | 2.8 GB/s | 2.1 GB/s | 12 GB/s |
| Random Read IOPS | 45K | 32K | 120K |
| Sequential Write | 800 MB/s | 650 MB/s | 2.8 GB/s |
| Latency P99 (4K random) | 2.1ms | 3.5ms | 0.4ms |
PScale (parallel file system) dramatically outperforms traditional NFS/SMB for multi-node access because it distributes data across multiple storage nodes and supports concurrent access without lock contention.
Multi-Client Stress Test
Simulate the actual training scenario: multiple pods reading from the same PVC simultaneously, each targeting a specific storage node IP:
#!/bin/bash
# Deploy FIO pods targeting specific PScale node IPs
# Validates that storage bandwidth scales linearly with clients
STORAGE_NODE_IP="10.0.1.100"
NUM_CLIENTS=8
for i in $(seq 1 $NUM_CLIENTS); do
kubectl run fio-client-$i \
--image=nixery.dev/fio \
--restart=Never \
--overrides='{
"spec": {
"containers": [{
"name": "fio",
"image": "nixery.dev/fio",
"command": ["fio"],
"args": [
"--name=client-'$i'",
"--directory=/data",
"--rw=read",
"--bs=1M",
"--size=5G",
"--numjobs=2",
"--iodepth=16",
"--direct=1",
"--output-format=json"
],
"volumeMounts": [{
"name": "data",
"mountPath": "/data"
}]
}],
"volumes": [{
"name": "data",
"persistentVolumeClaim": {
"claimName": "ai-training-pvc"
}
}]
}
}'
doneInterpreting Results for AI Training
Will My GPUs Starve?
Calculate the minimum storage throughput needed:
Required bandwidth = batch_size Γ sample_size Γ batches_per_second
Example (Vision - RetinaNet):
batch_size = 4 images/GPU Γ 4 GPUs = 16 images
sample_size = 800Γ800Γ3 = 1.92 MB
batches_per_second = 3 (based on GPU compute time)
Required = 16 Γ 1.92 Γ 3 = 92 MB/s β NFS easily handles this
Example (LLM - Mistral training):
Tokenized dataset loaded once into CPU RAM
Checkpoint size = 238 GB (full model)
Checkpoint frequency = every 200 steps (~30 minutes)
Required burst = 238 GB / 60s target = 4 GB/s β needs parallel FSRed Flags in FIO Results
- Latency spikes (P99 over 10Γ P50): storage contention or network congestion
- Throughput drops with multiple clients: lock contention on NFS metadata
- Write stalls: storage controller buffering full, need faster backing drives
- IOPS below 10K random: shuffle-based training will bottleneck
Related Articles
- Fine-Tuning Mistral with FSDP β the training workload that needs this storage
- RetinaNet DDP Training β vision training data pipeline
- Databases on Kubernetes β Memory Overcommit β storage performance patterns
- Distributed vs Multi-GPU Inference β model serving storage needs
- NVIDIA DOCA Perftest β network benchmarking companion
Fast GPUs with slow storage is like a Formula 1 engine on bicycle tires. Benchmark your storage before you blame your model for slow training.