The GPU Provisioning Challenge
Setting up a GPU cluster manually is painful. NVIDIA drivers, CUDA toolkit, container runtime configuration, GPU operator deployment, MIG partitioning β miss one step and your data scientists are staring at βCUDA out of memoryβ errors instead of training models.
Iβve automated GPU cluster provisioning across bare-metal, VMware, and cloud environments using Ansible, and the patterns Iβve developed save days of manual work per cluster.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββ
β Ansible Controller β
β (AAP / AWX / Command Line) β
βββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ βββββββββββ
β β Inventory β βPlaybooks β β Roles ββ
β β (dynamic) β β β β ββ
β ββββββββββββ ββββββββββββ βββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββ€
β GPU Worker Nodes (bare metal) β
β βββββββ βββββββ βββββββ βββββββ β
β β H100β β H100β βA100 β βA100 β β
β β Γ8 β β Γ8 β β Γ4 β β Γ4 β β
β βββββββ βββββββ βββββββ βββββββ β
βββββββββββββββββββββββββββββββββββββββββββBase OS Preparation
# roles/gpu_base/tasks/main.yml
---
- name: Ensure kernel headers match running kernel
ansible.builtin.dnf:
name:
- "kernel-devel-{{ ansible_kernel }}"
- "kernel-headers-{{ ansible_kernel }}"
- dkms
- gcc
- make
state: present
- name: Disable nouveau driver
ansible.builtin.copy:
dest: /etc/modprobe.d/blacklist-nouveau.conf
content: |
blacklist nouveau
options nouveau modeset=0
mode: '0644'
register: nouveau_blacklist
- name: Rebuild initramfs if nouveau was blacklisted
ansible.builtin.command: dracut --force
when: nouveau_blacklist.changed
- name: Configure IOMMU for GPU passthrough
ansible.builtin.lineinfile:
path: /etc/default/grub
regexp: '^GRUB_CMDLINE_LINUX='
line: 'GRUB_CMDLINE_LINUX="crashkernel=auto intel_iommu=on iommu=pt rd.driver.blacklist=nouveau"'
register: grub_config
- name: Rebuild GRUB config
ansible.builtin.command: grub2-mkconfig -o /boot/grub2/grub.cfg
when: grub_config.changedNVIDIA Driver Installation
# roles/nvidia_driver/tasks/main.yml
---
- name: Add NVIDIA CUDA repository
ansible.builtin.dnf:
name: "https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-repo-rhel9-12-6-local-12.6.0_560.28.03-1.x86_64.rpm"
state: present
disable_gpg_check: true
- name: Install NVIDIA driver and CUDA toolkit
ansible.builtin.dnf:
name:
- nvidia-driver-latest
- cuda-toolkit-12-6
- nvidia-fabric-manager
state: present
- name: Enable and start nvidia-fabricmanager
ansible.builtin.systemd:
name: nvidia-fabricmanager
enabled: true
state: started
- name: Verify GPU detection
ansible.builtin.command: nvidia-smi
register: nvidia_smi_output
changed_when: false
- name: Display GPU information
ansible.builtin.debug:
var: nvidia_smi_output.stdout_linesContainer Runtime Configuration
# roles/gpu_container_runtime/tasks/main.yml
---
- name: Install NVIDIA Container Toolkit
ansible.builtin.dnf:
name: nvidia-container-toolkit
state: present
- name: Configure containerd for NVIDIA runtime
ansible.builtin.template:
src: containerd-config.toml.j2
dest: /etc/containerd/config.toml
mode: '0644'
notify: Restart containerd
- name: Verify NVIDIA container runtime
ansible.builtin.command: >
ctr run --rm --gpus 0
docker.io/nvidia/cuda:12.6.0-base-ubi9
gpu-test nvidia-smi
register: container_gpu_test
changed_when: falseMIG Partitioning for Multi-Tenancy
# roles/nvidia_mig/tasks/main.yml
---
- name: Enable MIG mode on A100/H100 GPUs
ansible.builtin.command: "nvidia-smi -i {{ item }} -mig 1"
loop: "{{ gpu_indices }}"
register: mig_enable
changed_when: "'Enabled' in mig_enable.stdout"
- name: Create MIG GPU instances
ansible.builtin.command: >
nvidia-smi mig -i {{ item.gpu }}
-cgi {{ item.profile }}
-C
loop: "{{ mig_profiles }}"
when: mig_enable.changed
# defaults/main.yml
# gpu_indices: [0, 1, 2, 3]
# mig_profiles:
# - { gpu: 0, profile: "9,9,9" } # 3x 3g.20gb slices on GPU 0
# - { gpu: 1, profile: "14" } # 1x 7g.40gb full profile on GPU 1I covered the multi-tenant GPU orchestration patterns extensively at KubeCon EU 2026 β the MIG partitioning decisions have real-world cost implications that most teams underestimate.
Full Cluster Playbook
# playbooks/provision_gpu_cluster.yml
---
- name: Provision GPU Compute Cluster
hosts: gpu_workers
become: true
vars:
cuda_version: "12.6"
driver_branch: "560"
mig_enabled: true
roles:
- role: gpu_base
- role: nvidia_driver
- role: gpu_container_runtime
- role: nvidia_mig
when: mig_enabled
- role: gpu_monitoring
post_tasks:
- name: Run GPU validation suite
ansible.builtin.include_role:
name: gpu_validationDynamic Inventory for GPU Nodes
#!/usr/bin/env python3
# inventory/gpu_inventory.py
import json
import subprocess
def get_gpu_hosts():
# Query DCIM/CMDB for hosts with GPU hardware
hosts = query_netbox_gpu_hosts()
inventory = {"gpu_workers": {"hosts": {}}}
for host in hosts:
inventory["gpu_workers"]["hosts"][host["name"]] = {
"ansible_host": host["ip"],
"gpu_count": host["gpu_count"],
"gpu_model": host["gpu_model"],
"gpu_indices": list(range(host["gpu_count"])),
}
return inventoryMonitoring and Validation
# roles/gpu_monitoring/tasks/main.yml
---
- name: Deploy DCGM Exporter for Prometheus
ansible.builtin.command: >
ctr run -d --gpus 0
--net-host
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
dcgm-exporter
- name: Configure Prometheus scrape target
ansible.builtin.template:
src: prometheus-gpu-target.yml.j2
dest: /etc/prometheus/targets.d/gpu-nodes.yml
mode: '0644'For the observability stack integration, I detail Prometheus + Grafana patterns on Kubernetes Recipes β the same dashboards work for GPU metrics.
Lessons from Production
After provisioning GPU clusters across multiple enterprise environments, hereβs what I wish Iβd known earlier:
- Fabric Manager is critical for multi-GPU NVLink communication β without it, your H100s run at PCIe speeds
- MIG profiles are a business decision, not a technical one β talk to your ML teams first
- Driver upgrades require coordination β never auto-update GPU drivers in production
- Power and cooling are the real bottleneck β 8Γ H100 pulls 5.6kW, plan your rack power accordingly
- Test with real workloads, not just
nvidia-smiβ a passing health check doesnβt mean training will work
The complete Ansible roles for GPU cluster provisioning are available in my Ansible Pilot tutorials, where I walk through each component step by step.
