🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

DeepSpeed Optimization for RHEL AI Training

Luca Berton β€’
#rhel-ai#deepspeed#zero-3#mics#fp8#distributed-training#gpu-optimization#model-training

πŸ“˜ Book Reference: This article is based on Chapter 4: Advanced Features of Practical RHEL AI, providing a deep dive into DeepSpeed optimization for distributed AI training.

Introduction

Training large language models requires efficient memory management and distributed computing strategies. DeepSpeed, Microsoft’s open-source deep learning optimization library, is a core component of RHEL AI that enables training models that would otherwise be too large for available GPU memory.

Practical RHEL AI covers DeepSpeed extensively in Chapter 4, showing how to leverage:

Understanding ZeRO 3 Memory Optimization

What is ZeRO?

ZeRO (Zero Redundancy Optimizer) partitions model states across data-parallel processes to reduce memory redundancy. ZeRO Stage 3 is the most aggressive optimization level:

ZeRO StagePartitionsMemory Savings
Stage 1Optimizer states~4x
Stage 2+ Gradients~8x
Stage 3+ ParametersLinear with GPU count

Configuring ZeRO 3

Create a DeepSpeed configuration file:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  }
}

MiCS Communication Scaling

Why MiCS Matters

MiCS (Minimal Collective Size) optimizes all-reduce operations across GPUs, reducing communication overhead during distributed training:

# MiCS configuration for multi-node training
deepspeed_config = {
    "train_micro_batch_size_per_gpu": 4,
    "gradient_accumulation_steps": 8,
    "steps_per_print": 100,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-5,
            "weight_decay": 0.01
        }
    },
    "zero_optimization": {
        "stage": 3,
        "mics_shard_size": 4,
        "mics_hierarchical_params_gather": true
    }
}

Multi-Node Setup

For training across multiple RHEL AI nodes:

# Launch distributed training
deepspeed --hostfile hostfile.txt \
  --num_gpus 8 \
  --master_port 29500 \
  train.py \
  --deepspeed ds_config.json

FP8 Inference Optimization

Chapter 4 of Practical RHEL AI covers FP8 (8-bit floating point) inference for production deployments:

Benefits of FP8

Implementation

from vllm import LLM, SamplingParams

# Load model with FP8 quantization
llm = LLM(
    model="granite-3b-instruct",
    dtype="float8_e4m3fn",
    quantization="fp8",
    tensor_parallel_size=4
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512
)

# Run inference
outputs = llm.generate(prompts, sampling_params)

Training Configuration Recipes

Recipe 1: Single GPU (A100 80GB)

{
  "train_micro_batch_size_per_gpu": 8,
  "gradient_accumulation_steps": 4,
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "reduce_scatter": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  }
}

Recipe 2: Multi-GPU (8x H100)

{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "none"},
    "offload_param": {"device": "none"}
  },
  "bf16": {
    "enabled": true
  }
}

Recipe 3: Memory-Constrained (CPU Offload)

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 32,
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"}
  }
}

Monitoring Training Performance

Track DeepSpeed metrics with Prometheus:

# prometheus-deepspeed-rules.yaml
groups:
  - name: deepspeed_training
    rules:
      - alert: DeepSpeedOOMWarning
        expr: deepspeed_memory_usage_bytes > 0.9 * deepspeed_memory_total_bytes
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "DeepSpeed memory usage > 90%"
      
      - alert: DeepSpeedTrainingStalled
        expr: rate(deepspeed_training_steps_total[5m]) == 0
        for: 10m
        labels:
          severity: critical

Best Practices from Chapter 4

  1. Start with ZeRO Stage 2 if your model fits in memory
  2. Enable CPU offload only when necessary (adds latency)
  3. Use MiCS for training across >4 nodes
  4. Profile first: Use deepspeed --profile to identify bottlenecks
  5. Tune bucket sizes based on your model architecture

This article covers material from:


πŸ“š Accelerate Your Model Training

Ready to train large models efficiently?

Practical RHEL AI provides comprehensive DeepSpeed guidance:

⚑ Train Models 10x Faster

Practical RHEL AI shows you how to squeeze maximum performance from your GPU infrastructure with DeepSpeed.

Learn More β†’Buy on Amazon β†’
← Back to Blog