Ansible + AI: Using LLMs to Generate and Validate Playbooks
LLMs can write Ansible playbooks, but should you trust them? Here's how to use AI for playbook generation with proper validation, linting, and safety guardrails.
The promise of self-healing infrastructure has been around for years, but LLMs have finally made it practical. Instead of writing rigid if-then rules for every possible failure, you can build an agent that understands context and selects the right remediation.
Iβve built several of these systems. Hereβs the practical blueprint.
Monitoring Stack (Prometheus/Alertmanager)
β webhook
AI Decision Engine (LLM)
β selects
Ansible Playbook Library
β executes
Infrastructure (via Ansible)
β verifies
Health Check β Success/EscalateConfigure Alertmanager to send webhooks to your agent:
# alertmanager.yml
receivers:
- name: ai-agent
webhook_configs:
- url: 'http://healing-agent.internal:8080/alert'
send_resolved: true
max_alerts: 5Your agent receives structured alert data:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class AlertPayload(BaseModel):
alerts: list[dict]
@app.post("/alert")
async def handle_alert(payload: AlertPayload):
for alert in payload.alerts:
if alert["status"] == "firing":
await process_alert(alert)The LLM analyzes the alert and selects from your playbook library:
import json
PLAYBOOK_CATALOG = {
"restart_service": {
"description": "Restart a systemd service",
"params": ["service_name", "target_hosts"],
"risk": "low",
"playbook": "playbooks/restart-service.yml",
},
"clear_disk_space": {
"description": "Clean temp files and old logs to free disk space",
"params": ["target_hosts", "min_free_gb"],
"risk": "low",
"playbook": "playbooks/clear-disk.yml",
},
"scale_pods": {
"description": "Scale Kubernetes deployment replicas",
"params": ["namespace", "deployment", "replicas"],
"risk": "medium",
"playbook": "playbooks/scale-deployment.yml",
},
"failover_database": {
"description": "Promote database replica to primary",
"params": ["cluster_name"],
"risk": "high",
"playbook": "playbooks/db-failover.yml",
},
}
async def decide_action(alert: dict) -> dict:
prompt = f"""You are an infrastructure reliability agent.
Analyze this alert and select the best remediation action.
Alert: {json.dumps(alert, indent=2)}
Available playbooks: {json.dumps(PLAYBOOK_CATALOG, indent=2)}
Respond with JSON: {{"playbook": "name", "params": {{}}, "reasoning": "..."}}
If no automated action is appropriate, respond with {{"playbook": "escalate", "reasoning": "..."}}
"""
response = await llm.generate(prompt, temperature=0)
return json.loads(response)Never skip the safety layer. Risk-based approval:
async def execute_with_safety(decision: dict, alert: dict):
playbook = PLAYBOOK_CATALOG.get(decision["playbook"])
if not playbook:
await escalate_to_human(alert, "Unknown playbook selected")
return
risk = playbook["risk"]
if risk == "low":
# Auto-execute, notify after
result = await run_ansible(playbook, decision["params"])
await notify_team(f"Auto-remediated: {decision['reasoning']}")
elif risk == "medium":
# Execute with 5-minute delay (cancel window)
msg = await notify_team(
f"β οΈ Will execute in 5 min: {playbook['description']}\n"
f"Reason: {decision['reasoning']}\n"
f"React with β to cancel."
)
await asyncio.sleep(300)
if not await was_cancelled(msg):
result = await run_ansible(playbook, decision["params"])
elif risk == "high":
# Always require human approval
await escalate_to_human(alert, decision["reasoning"])import ansible_runner
async def run_ansible(playbook: dict, params: dict) -> dict:
result = ansible_runner.run(
playbook=playbook["playbook"],
extravars=params,
timeout=300,
)
return {
"status": result.status,
"rc": result.rc,
"stats": result.stats,
}Example playbook for disk cleanup:
# playbooks/clear-disk.yml
---
- name: Clear disk space
hosts: "{{ target_hosts }}"
become: true
tasks:
- name: Remove old journal logs
command: journalctl --vacuum-time=7d
- name: Clean package cache
ansible.builtin.dnf:
autoremove: true
- name: Remove temp files older than 7 days
find:
paths: /tmp
age: 7d
recurse: true
register: old_tmp
- name: Delete old temp files
file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_tmp.files }}"
- name: Verify free space
assert:
that:
- ansible_facts.mounts | selectattr('mount', 'equalto', '/')
| map(attribute='size_available') | first > (min_free_gb | int * 1073741824)
fail_msg: "Disk space still below {{ min_free_gb }}GB after cleanup"After remediation, verify the fix worked:
async def verify_remediation(alert: dict, action_result: dict):
# Wait for metrics to stabilize
await asyncio.sleep(60)
# Check if the alert resolved
resolved = await check_alert_resolved(alert["fingerprint"])
if resolved:
await log_success(alert, action_result)
else:
# Try once more or escalate
await escalate_to_human(
alert,
f"Auto-remediation attempted but alert persists. "
f"Action taken: {action_result}"
)From a recent deployment:
The best self-healing systems augment your team, not replace them. Start small, build trust, expand gradually.
Want to build self-healing infrastructure for your team? I help organizations design AI-powered automation systems. Letβs connect.
AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.
LLMs can write Ansible playbooks, but should you trust them? Here's how to use AI for playbook generation with proper validation, linting, and safety guardrails.
Design, build, and distribute Ansible Collections that your team will actually reuse. Naming conventions, testing, versioning, and Galaxy publishing.
Automate the provisioning of GPU compute clusters with Ansible. NVIDIA driver installation, CUDA setup, container runtime configuration, and health checks.