π Book Reference: This article is based on Chapter 2: Setup and Chapter 4: Advanced Features of Practical RHEL AI, covering GPU hardware requirements and optimization strategies.
Choosing the right GPU for your RHEL AI deployment is one of the most consequential infrastructure decisions youβll make. The wrong choice can mean wasted budget, performance bottlenecks, or inability to run your target models.
Practical RHEL AI provides detailed hardware guidance that Iβll summarize here, helping you match GPU capabilities to your specific workload requirements.
RHEL AI officially supports:
| GPU | Memory | Interconnect | Best For |
|---|---|---|---|
| NVIDIA A100 | 40GB / 80GB | NVLink 3.0 | Training & Inference |
| NVIDIA H100 | 80GB | NVLink 4.0 | Large Model Training |
| AMD MI300X | 192GB | Infinity Fabric | Memory-Bound Workloads |
Architecture: Ampere
CUDA Cores: 6,912
Tensor Cores: 432 (3rd Gen)
Memory: 40GB or 80GB HBM2e
Memory Bandwidth: 2 TB/s
TDP: 400Wβ Ideal for:
β Limitations:
# Verify A100 detection
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv
# Expected output:
# NVIDIA A100-SXM4-80GB, 81920 MiB, 8.0
# Optimal vLLM settings for A100
python -m vllm.entrypoints.openai.api_server \
--model granite-7b-instruct \
--dtype float16 \
--gpu-memory-utilization 0.92 \
--max-model-len 8192Architecture: Hopper
CUDA Cores: 16,896
Tensor Cores: 528 (4th Gen)
Memory: 80GB HBM3
Memory Bandwidth: 3.35 TB/s
TDP: 700W
FP8 Tensor Core Performance: 3,958 TFLOPSβ Ideal for:
β Limitations:
| Workload | A100 80GB | H100 80GB | H100 Advantage |
|---|---|---|---|
| GPT-3 175B Training | 1.0x | 3.0x | 3x faster |
| Llama 70B Inference | 1.0x | 2.4x | 2.4x faster |
| Granite 7B Inference | 1.0x | 1.6x | 1.6x faster |
| Fine-tuning 13B | 1.0x | 2.2x | 2.2x faster |
from vllm import LLM, SamplingParams
# H100 with FP8 quantization
llm = LLM(
model="granite-34b-instruct",
dtype="float8_e4m3fn",
quantization="fp8",
tensor_parallel_size=2,
enforce_eager=False # Use CUDA graphs
)
# 50% memory reduction, minimal accuracy lossArchitecture: CDNA 3
Compute Units: 304
Memory: 192GB HBM3
Memory Bandwidth: 5.3 TB/s
TDP: 750Wβ Ideal for:
β Limitations:
# Install AMD GPU support
sudo dnf install -y rocm-hip-runtime rocm-hip-sdk
# Verify MI300X detection
rocm-smi --showproductname
# vLLM with ROCm backend
python -m vllm.entrypoints.openai.api_server \
--model granite-70b-instruct \
--dtype float16 \
--device rocm| Model Size | Minimum GPU Memory | Recommended Configuration |
|---|---|---|
| 3B params | 8GB | A100 40GB (comfortable) |
| 7B params | 16GB | A100 40GB or H100 |
| 13B params | 32GB | A100 80GB |
| 34B params | 80GB | H100 or 2x A100 |
| 70B params | 150GB+ | MI300X or 2x H100 |
def estimate_gpu_memory(params_billions, dtype_bytes=2, overhead=1.2):
"""
Estimate GPU memory required for inference.
Args:
params_billions: Model parameters in billions
dtype_bytes: 2 for FP16, 1 for INT8, 0.5 for INT4
overhead: KV cache and activation memory (1.2 = 20% overhead)
Returns:
Required GPU memory in GB
"""
base_memory = params_billions * dtype_bytes
return base_memory * overhead
# Examples:
# 7B FP16: 7 * 2 * 1.2 = 16.8 GB
# 70B FP16: 70 * 2 * 1.2 = 168 GB
# 70B INT8: 70 * 1 * 1.2 = 84 GB| Configuration | Total Memory | Use Case |
|---|---|---|
| 4x A100 80GB | 320GB | Fine-tune up to 34B models |
| 8x A100 80GB | 640GB | Train 70B models |
| 8x H100 80GB | 640GB | Train 70B+ models faster |
| 4x MI300X | 768GB | Memory-intensive training |
NVLink 4.0 (H100): 900 GB/s bidirectional
NVLink 3.0 (A100): 600 GB/s bidirectional
PCIe 5.0: 128 GB/s bidirectional
PCIe 4.0: 64 GB/s bidirectional
Recommendation: Always use NVLink for multi-GPU training| GPU | Hardware Cost | Power (3yr) | Total TCO |
|---|---|---|---|
| A100 80GB | $15,000 | $4,200 | $19,200 |
| H100 80GB | $35,000 | $7,350 | $42,350 |
| MI300X | $20,000 | $7,875 | $27,875 |
| GPU | AWS | Azure | GCP |
|---|---|---|---|
| A100 40GB | $3.06/hr | $3.40/hr | $2.95/hr |
| A100 80GB | $4.10/hr | $4.50/hr | $4.00/hr |
| H100 80GB | $8.00/hr | $8.50/hr | $8.25/hr |
START
β
ββ Budget constrained?
β YES β A100 80GB
β NO β
β
ββ Model size > 34B parameters?
β YES β H100 or MI300X
β NO β
β
ββ Training large models?
β YES β H100 (speed matters)
β NO β
β
ββ Inference-only workload?
β YES β A100 (cost-effective)
β NO β
β
ββ Mixed workload β H100 (future-proof)# Run RHEL AI hardware validation
rhel-ai-validate --check hardware
# Expected output:
# β GPU detected: NVIDIA H100 80GB
# β Driver version: 535.129.03
# β CUDA version: 12.4
# β Memory: 80GB HBM3
# β NVLink: Connected (4 links)
# β PCIe: Gen5 x16
#
# Hardware validation: PASSEDThis article covers material from:
Need help sizing your RHEL AI infrastructure?
Practical RHEL AI includes comprehensive hardware planning resources:
Donβt overspendβor underspendβon GPU hardware. Practical RHEL AI helps you choose the perfect configuration for your workloads and budget.
Learn More βBuy on Amazon β