🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

GPU Hardware Selection Guide for RHEL AI

Luca Berton β€’
#rhel-ai#gpu#nvidia-a100#nvidia-h100#amd-mi300x#hardware-selection#benchmarks#infrastructure

πŸ“˜ Book Reference: This article is based on Chapter 2: Setup and Chapter 4: Advanced Features of Practical RHEL AI, covering GPU hardware requirements and optimization strategies.

Introduction

Choosing the right GPU for your RHEL AI deployment is one of the most consequential infrastructure decisions you’ll make. The wrong choice can mean wasted budget, performance bottlenecks, or inability to run your target models.

Practical RHEL AI provides detailed hardware guidance that I’ll summarize here, helping you match GPU capabilities to your specific workload requirements.

Supported GPU Hardware

RHEL AI officially supports:

GPUMemoryInterconnectBest For
NVIDIA A10040GB / 80GBNVLink 3.0Training & Inference
NVIDIA H10080GBNVLink 4.0Large Model Training
AMD MI300X192GBInfinity FabricMemory-Bound Workloads

NVIDIA A100: The Proven Workhorse

Specifications

Architecture: Ampere
CUDA Cores: 6,912
Tensor Cores: 432 (3rd Gen)
Memory: 40GB or 80GB HBM2e
Memory Bandwidth: 2 TB/s
TDP: 400W

When to Choose A100

βœ… Ideal for:

❌ Limitations:

RHEL AI Configuration

# Verify A100 detection
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

# Expected output:
# NVIDIA A100-SXM4-80GB, 81920 MiB, 8.0

# Optimal vLLM settings for A100
python -m vllm.entrypoints.openai.api_server \
  --model granite-7b-instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192

NVIDIA H100: Maximum Performance

Specifications

Architecture: Hopper
CUDA Cores: 16,896
Tensor Cores: 528 (4th Gen)
Memory: 80GB HBM3
Memory Bandwidth: 3.35 TB/s
TDP: 700W
FP8 Tensor Core Performance: 3,958 TFLOPS

When to Choose H100

βœ… Ideal for:

❌ Limitations:

Performance Comparison

WorkloadA100 80GBH100 80GBH100 Advantage
GPT-3 175B Training1.0x3.0x3x faster
Llama 70B Inference1.0x2.4x2.4x faster
Granite 7B Inference1.0x1.6x1.6x faster
Fine-tuning 13B1.0x2.2x2.2x faster

H100 with FP8 Inference

from vllm import LLM, SamplingParams

# H100 with FP8 quantization
llm = LLM(
    model="granite-34b-instruct",
    dtype="float8_e4m3fn",
    quantization="fp8",
    tensor_parallel_size=2,
    enforce_eager=False  # Use CUDA graphs
)

# 50% memory reduction, minimal accuracy loss

AMD MI300X: Memory Champion

Specifications

Architecture: CDNA 3
Compute Units: 304
Memory: 192GB HBM3
Memory Bandwidth: 5.3 TB/s
TDP: 750W

When to Choose MI300X

βœ… Ideal for:

❌ Limitations:

RHEL AI with ROCm

# Install AMD GPU support
sudo dnf install -y rocm-hip-runtime rocm-hip-sdk

# Verify MI300X detection
rocm-smi --showproductname

# vLLM with ROCm backend
python -m vllm.entrypoints.openai.api_server \
  --model granite-70b-instruct \
  --dtype float16 \
  --device rocm

Memory Requirements by Model Size

Model SizeMinimum GPU MemoryRecommended Configuration
3B params8GBA100 40GB (comfortable)
7B params16GBA100 40GB or H100
13B params32GBA100 80GB
34B params80GBH100 or 2x A100
70B params150GB+MI300X or 2x H100

Memory Calculation Formula

def estimate_gpu_memory(params_billions, dtype_bytes=2, overhead=1.2):
    """
    Estimate GPU memory required for inference.
    
    Args:
        params_billions: Model parameters in billions
        dtype_bytes: 2 for FP16, 1 for INT8, 0.5 for INT4
        overhead: KV cache and activation memory (1.2 = 20% overhead)
    
    Returns:
        Required GPU memory in GB
    """
    base_memory = params_billions * dtype_bytes
    return base_memory * overhead

# Examples:
# 7B FP16:  7 * 2 * 1.2 = 16.8 GB
# 70B FP16: 70 * 2 * 1.2 = 168 GB
# 70B INT8: 70 * 1 * 1.2 = 84 GB

Multi-GPU Configurations

Training Configurations

ConfigurationTotal MemoryUse Case
4x A100 80GB320GBFine-tune up to 34B models
8x A100 80GB640GBTrain 70B models
8x H100 80GB640GBTrain 70B+ models faster
4x MI300X768GBMemory-intensive training
NVLink 4.0 (H100): 900 GB/s bidirectional
NVLink 3.0 (A100): 600 GB/s bidirectional
PCIe 5.0: 128 GB/s bidirectional
PCIe 4.0: 64 GB/s bidirectional

Recommendation: Always use NVLink for multi-GPU training

Cost Analysis

On-Premises (3-Year TCO)

GPUHardware CostPower (3yr)Total TCO
A100 80GB$15,000$4,200$19,200
H100 80GB$35,000$7,350$42,350
MI300X$20,000$7,875$27,875

Cloud Hourly Rates (Approximate)

GPUAWSAzureGCP
A100 40GB$3.06/hr$3.40/hr$2.95/hr
A100 80GB$4.10/hr$4.50/hr$4.00/hr
H100 80GB$8.00/hr$8.50/hr$8.25/hr

Decision Framework

START
  β”‚
  β”œβ”€ Budget constrained?
  β”‚   YES β†’ A100 80GB
  β”‚   NO ↓
  β”‚
  β”œβ”€ Model size > 34B parameters?
  β”‚   YES β†’ H100 or MI300X
  β”‚   NO ↓
  β”‚
  β”œβ”€ Training large models?
  β”‚   YES β†’ H100 (speed matters)
  β”‚   NO ↓
  β”‚
  β”œβ”€ Inference-only workload?
  β”‚   YES β†’ A100 (cost-effective)
  β”‚   NO ↓
  β”‚
  └─ Mixed workload β†’ H100 (future-proof)

RHEL AI Hardware Validation

# Run RHEL AI hardware validation
rhel-ai-validate --check hardware

# Expected output:
# βœ“ GPU detected: NVIDIA H100 80GB
# βœ“ Driver version: 535.129.03
# βœ“ CUDA version: 12.4
# βœ“ Memory: 80GB HBM3
# βœ“ NVLink: Connected (4 links)
# βœ“ PCIe: Gen5 x16
# 
# Hardware validation: PASSED

This article covers material from:


πŸ“š Complete Hardware Planning Guide

Need help sizing your RHEL AI infrastructure?

Practical RHEL AI includes comprehensive hardware planning resources:

πŸ’° Make the Right Investment

Don’t overspendβ€”or underspendβ€”on GPU hardware. Practical RHEL AI helps you choose the perfect configuration for your workloads and budget.

Learn More β†’Buy on Amazon β†’
← Back to Blog