Deploying RHEL AI with Ansible: From Bare Metal to Production
Red Hat Enterprise Linux AI (RHEL AI) brings InstructLab, Granite models, and optimized inference to the enterprise. But deploying it at scale β across multiple nodes, with proper GPU configuration, model management, and monitoring β requires automation.
Iβve been deploying RHEL AI platforms for enterprise clients through Open Empower, and Ansible is the backbone of every deployment.
What is RHEL AI?
RHEL AI is a foundation model platform built on RHEL that includes:
- InstructLab: fine-tuning framework for customizing Granite models
- vLLM: high-performance inference server
- Granite models: IBMβs open-source LLMs optimized for enterprise
- bootc: image-based OS for immutable infrastructure
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Ansible Automation Platform β
β (Orchestration & Configuration) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β RHEL AI β β RHEL AI β β
β β Node 1 β β Node 2 β β
β β ββββββββββ β β ββββββββββ β β
β β β vLLM β β β βInstructLab β β
β β βGranite β β β βFine-tuning β β
β β β 3.1 8B β β β βGranite 3.1 β β
β β β H100Γ4 β β β β A100Γ8 β β
β β ββββββββββ β β ββββββββββ β β
β ββββββββββββββββ ββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββPrerequisite: GPU Setup
First, provision the GPU infrastructure β I covered this in detail in my GPU cluster provisioning guide. Ensure NVIDIA drivers and container toolkit are installed.
RHEL AI Base Deployment
# playbooks/deploy_rhel_ai.yml
---
- name: Deploy RHEL AI Platform
hosts: rhel_ai_nodes
become: true
vars:
rhel_ai_version: "1.4"
model_name: "granite-3.1-8b-instruct"
vllm_port: 8000
gpu_count: 4
tasks:
- name: Register system with Red Hat Subscription
community.general.redhat_subscription:
state: present
username: "{{ rhsm_user }}"
password: "{{ rhsm_password }}"
pool_ids: "{{ rhel_ai_pool_id }}"
- name: Install RHEL AI packages
ansible.builtin.dnf:
name:
- instructlab
- vllm
- python3.11-vllm
state: present
- name: Download Granite model
ansible.builtin.command: >
ilab model download
--repository instructlab/granite-3.1-8b-instruct
--release v3.1
args:
creates: "/var/lib/instructlab/models/granite-3.1-8b-instruct"
environment:
HF_TOKEN: "{{ huggingface_token }}"
- name: Configure vLLM serving
ansible.builtin.template:
src: vllm-config.yml.j2
dest: /etc/vllm/config.yml
mode: '0644'
notify: Restart vLLM service
- name: Deploy vLLM systemd service
ansible.builtin.template:
src: vllm.service.j2
dest: /etc/systemd/system/vllm.service
mode: '0644'
notify:
- Reload systemd
- Restart vLLM service
- name: Enable and start vLLM
ansible.builtin.systemd:
name: vllm
enabled: true
state: startedvLLM Service Template
# templates/vllm.service.j2
[Unit]
Description=vLLM Inference Server
After=network.target nvidia-fabricmanager.service
[Service]
Type=simple
User=vllm
Group=vllm
Environment=CUDA_VISIBLE_DEVICES=0,1,2,3
ExecStart=/usr/bin/python3.11 -m vllm.entrypoints.openai.api_server --model /var/lib/instructlab/models/{{ model_name }} --tensor-parallel-size {{ gpu_count }} --port {{ vllm_port }} --max-model-len 8192 --gpu-memory-utilization 0.9
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.targetInstructLab Fine-Tuning Automation
# playbooks/finetune_model.yml
---
- name: Fine-tune Granite Model with InstructLab
hosts: training_nodes
become: true
vars:
taxonomy_repo: "https://gitlab.internal.acme.com/ai/custom-taxonomy.git"
base_model: "granite-3.1-8b-instruct"
output_model: "granite-3.1-8b-acme-v1"
tasks:
- name: Clone custom taxonomy
ansible.builtin.git:
repo: "{{ taxonomy_repo }}"
dest: /var/lib/instructlab/taxonomy
version: main
- name: Initialize InstructLab
ansible.builtin.command: ilab config init
args:
creates: /var/lib/instructlab/config.yaml
- name: Generate synthetic training data
ansible.builtin.command: >
ilab data generate
--model {{ base_model }}
--taxonomy-path /var/lib/instructlab/taxonomy
--output-dir /var/lib/instructlab/datasets/{{ output_model }}
--num-instructions 500
environment:
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
async: 7200 # 2 hour timeout for data generation
poll: 60
- name: Train model
ansible.builtin.command: >
ilab model train
--input-dir /var/lib/instructlab/datasets/{{ output_model }}
--model-path /var/lib/instructlab/models/{{ base_model }}
--output-dir /var/lib/instructlab/models/{{ output_model }}
--device cuda
--num-epochs 3
environment:
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
async: 14400 # 4 hour timeout for training
poll: 120
- name: Evaluate fine-tuned model
ansible.builtin.command: >
ilab model evaluate
--model /var/lib/instructlab/models/{{ output_model }}
--benchmark mmlu
register: eval_result
- name: Display evaluation results
ansible.builtin.debug:
var: eval_result.stdout_linesHealth Checks and Monitoring
# roles/rhel_ai_monitoring/tasks/main.yml
---
- name: Deploy inference health check script
ansible.builtin.copy:
dest: /usr/local/bin/vllm-healthcheck.sh
mode: '0755'
content: |
#!/bin/bash
curl -sf http://localhost:{{ vllm_port }}/health || exit 1
# Check GPU memory usage
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk '{s+=$1} END {print s}')
if [ "$GPU_MEM" -gt "{{ gpu_memory_threshold }}" ]; then
echo "WARNING: GPU memory usage high: ${GPU_MEM}MB"
fi
- name: Configure Prometheus metrics endpoint
ansible.builtin.template:
src: vllm-metrics.yml.j2
dest: /etc/prometheus/targets.d/vllm.yml
mode: '0644'
- name: Test inference endpoint
ansible.builtin.uri:
url: "http://localhost:{{ vllm_port }}/v1/chat/completions"
method: POST
body_format: json
body:
model: "{{ model_name }}"
messages:
- role: user
content: "Hello, are you operational?"
max_tokens: 50
status_code: 200
register: inference_test
retries: 3
delay: 5Production Deployment Checklist
Hereβs what I verify on every RHEL AI deployment:
- GPU verification β
nvidia-smishows all expected GPUs with correct driver version - Model integrity β checksums match the published hashes
- Inference latency β first token latency under SLA (typically under 500ms for 8B models)
- Memory headroom β at least 10% GPU memory free under load
- TLS termination β never expose vLLM directly; always behind a reverse proxy with mTLS
- Rate limiting β prevent single tenants from monopolizing inference capacity
- Logging β request/response logging for compliance (with PII redaction)
I presented the full multi-tenant GPU orchestration story at Red Hat Summit 2026 and KubeCon EU 2026 β the production patterns for RHEL AI scale directly from these Ansible foundations.
Resources
- Ansible Pilot β step-by-step RHEL AI deployment tutorials
- Ansible by Example β role development patterns used in these playbooks
- Kubernetes Recipes β for deploying vLLM on OpenShift with GPU operators
- Open Empower β consulting services for enterprise RHEL AI deployments
