π Book Reference: This article is based on Chapter 4: Advanced Features of Practical RHEL AI, providing a deep dive into DeepSpeed optimization for distributed AI training.
Training large language models requires efficient memory management and distributed computing strategies. DeepSpeed, Microsoftβs open-source deep learning optimization library, is a core component of RHEL AI that enables training models that would otherwise be too large for available GPU memory.
Practical RHEL AI covers DeepSpeed extensively in Chapter 4, showing how to leverage:
ZeRO (Zero Redundancy Optimizer) partitions model states across data-parallel processes to reduce memory redundancy. ZeRO Stage 3 is the most aggressive optimization level:
| ZeRO Stage | Partitions | Memory Savings |
|---|---|---|
| Stage 1 | Optimizer states | ~4x |
| Stage 2 | + Gradients | ~8x |
| Stage 3 | + Parameters | Linear with GPU count |
Create a DeepSpeed configuration file:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9
}
}MiCS (Minimal Collective Size) optimizes all-reduce operations across GPUs, reducing communication overhead during distributed training:
# MiCS configuration for multi-node training
deepspeed_config = {
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8,
"steps_per_print": 100,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-5,
"weight_decay": 0.01
}
},
"zero_optimization": {
"stage": 3,
"mics_shard_size": 4,
"mics_hierarchical_params_gather": true
}
}For training across multiple RHEL AI nodes:
# Launch distributed training
deepspeed --hostfile hostfile.txt \
--num_gpus 8 \
--master_port 29500 \
train.py \
--deepspeed ds_config.jsonChapter 4 of Practical RHEL AI covers FP8 (8-bit floating point) inference for production deployments:
from vllm import LLM, SamplingParams
# Load model with FP8 quantization
llm = LLM(
model="granite-3b-instruct",
dtype="float8_e4m3fn",
quantization="fp8",
tensor_parallel_size=4
)
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=512
)
# Run inference
outputs = llm.generate(prompts, sampling_params){
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"reduce_scatter": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
}
}{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "none"},
"offload_param": {"device": "none"}
},
"bf16": {
"enabled": true
}
}{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 32,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu"},
"offload_param": {"device": "cpu"}
}
}Track DeepSpeed metrics with Prometheus:
# prometheus-deepspeed-rules.yaml
groups:
- name: deepspeed_training
rules:
- alert: DeepSpeedOOMWarning
expr: deepspeed_memory_usage_bytes > 0.9 * deepspeed_memory_total_bytes
for: 2m
labels:
severity: warning
annotations:
summary: "DeepSpeed memory usage > 90%"
- alert: DeepSpeedTrainingStalled
expr: rate(deepspeed_training_steps_total[5m]) == 0
for: 10m
labels:
severity: criticaldeepspeed --profile to identify bottlenecksThis article covers material from:
Ready to train large models efficiently?
Practical RHEL AI provides comprehensive DeepSpeed guidance:
Practical RHEL AI shows you how to squeeze maximum performance from your GPU infrastructure with DeepSpeed.
Learn More βBuy on Amazon β