Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
GPU Cluster Architecture Training and Inference
AI

GPU Cluster Architecture: Designing for Both Training and

Training and inference have different resource profiles. Design GPU clusters that handle both workloads efficiently with dynamic partitioning.

LB
Luca Berton
Β· 2 min read

GPU clusters for AI workloads come in two fundamentally different architectures: training clusters optimized for throughput and collective communication, and inference clusters optimized for latency and cost efficiency. Getting this wrong wastes hundreds of thousands of dollars.

Training vs Inference: Different Requirements

DimensionTraining ClusterInference Cluster
PriorityThroughput (tokens/second)Latency (ms per request)
GPU utilization target95%+ sustained50-70% (headroom for bursts)
NetworkInfiniBand 400Gbps (RDMA)Standard 100GbE sufficient
StorageHigh-throughput parallel FSFast model loading, minimal storage
Batch sizeLarge (thousands)Small (1-64)
Scaling patternAll GPUs active simultaneouslyScale up/down with demand
Cost model$/training run$/1000 tokens served
Failure toleranceCheckpoint and resumeRedundancy and failover

Training Cluster Architecture

Network Topology

Training clusters need all-reduce communication across all GPUs. The network is the bottleneck:

β”Œβ”€β”€β”€β”€ Spine Switch (InfiniBand 400G) ────┐
β”‚                                         β”‚
β”œβ”€β”€β”€ Leaf ────  β”œβ”€β”€β”€ Leaf ────  β”œβ”€β”€β”€ Leaf ────
β”‚   Switch   β”‚  β”‚   Switch   β”‚  β”‚   Switch   β”‚
β”‚            β”‚  β”‚            β”‚  β”‚            β”‚
β”‚ β”Œβ”€β”€β” β”Œβ”€β”€β” β”‚  β”‚ β”Œβ”€β”€β” β”Œβ”€β”€β” β”‚  β”‚ β”Œβ”€β”€β” β”Œβ”€β”€β” β”‚
β”‚ β”‚N1β”‚ β”‚N2β”‚ β”‚  β”‚ β”‚N3β”‚ β”‚N4β”‚ β”‚  β”‚ β”‚N5β”‚ β”‚N6β”‚ β”‚
β”‚ β”‚8Gβ”‚ β”‚8Gβ”‚ β”‚  β”‚ β”‚8Gβ”‚ β”‚8Gβ”‚ β”‚  β”‚ β”‚8Gβ”‚ β”‚8Gβ”‚ β”‚
β”‚ β””β”€β”€β”˜ β””β”€β”€β”˜ β”‚  β”‚ β””β”€β”€β”˜ β””β”€β”€β”˜ β”‚  β”‚ β””β”€β”€β”˜ β””β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         48 GPUs total (6 nodes Γ— 8 GPUs)

Storage Architecture

Training data needs high-throughput parallel access:

  • GPFS/Lustre/WekaFS: Parallel filesystem for dataset access
  • Throughput target: 100+ GB/s aggregate read for large clusters
  • NVMe cache: Local NVMe on each node for checkpoint writes
  • Object storage: S3/MinIO for long-term dataset and checkpoint storage

NCCL Configuration

# Essential NCCL environment variables for training clusters
export NCCL_IB_HCA=mlx5_0,mlx5_1    # InfiniBand adapters
export NCCL_SOCKET_IFNAME=eth0        # Fallback interface
export NCCL_DEBUG=INFO                # Debugging
export NCCL_TIMEOUT=1800              # 30 min timeout for large all-reduce
export NCCL_ALGO=Ring                 # or Tree for different topologies
export NCCL_PROTO=Simple              # or LL128 for low latency

Inference Cluster Architecture

Autoscaling Architecture

Inference clusters must scale with demand:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: "5"  # Scale when queue > 5 per pod

Multi-Model Serving

Run multiple models on shared GPU infrastructure:

  • Time-slicing: Multiple models share one GPU via time division
  • MIG (Multi-Instance GPU): Hardware partitioning on A100/H100
  • MPS (Multi-Process Service): Concurrent model execution
  • KV-cache sharing: Models share KV-cache across requests (vLLM, NVIDIA Dynamo)

Cost Optimization

Inference cost optimization strategies:

  1. Right-size GPUs: Use L4 for small models, A100 for medium, H100 for large
  2. Quantization: INT8/INT4 reduces GPU memory and increases throughput 2-4x
  3. Batching: Dynamic batching increases GPU utilization significantly
  4. Spot/preemptible instances: Use for non-latency-sensitive batch inference
  5. Model routing: Route simple queries to smaller/cheaper models

Hybrid Cluster Design

Some organizations run both training and inference on shared infrastructure:

  • Partition by time: Training at night (cheap electricity), inference during business hours
  • Partition by node: Dedicated training nodes (InfiniBand) + inference nodes (Ethernet)
  • Partition by GPU: MIG slices for inference on some GPUs, full GPUs for training

Sizing Guide

Training Cluster Sizing

GPUs needed = Model Parameters / (GPU Memory Γ— Parallelism Efficiency)

Example: Training a 70B model
  - Model: 70B parameters Γ— 4 bytes (FP32) = 280 GB
  - With mixed precision: ~140 GB
  - Per H100 (80GB): Need at least 2 GPUs for model weights
  - With optimizer states: 4-8Γ— model size = 560 GB - 1.1 TB
  - Minimum: 8Γ— H100 (one node)
  - For reasonable training time: 32-64Γ— H100

Inference Cluster Sizing

GPUs needed = (Peak QPS Γ— Latency Target) / Throughput per GPU

Example: Serving a 70B model
  - Target: 100 QPS, under 2s latency
  - Per H100 with vLLM: ~30 QPS at 2s latency
  - GPUs needed: ceil(100/30) = 4 GPUs
  - With redundancy (N+1): 5 GPUs
  - Plus 50% headroom for burst: 8 GPUs (1 node)

Free 30-min AI & Cloud consultation

Book Now