Ansible + AI: Using LLMs to Generate and Validate Playbooks
LLMs can write Ansible playbooks, but should you trust them? Here's how to use AI for playbook generation with proper validation, linting, and safety guardrails.
PagerDuty fires. Disk usage on prod-db-01 hit 90%. An SRE wakes up, SSHs in, cleans up old logs, and goes back to sleep. The exact same alert, the exact same fix, for the third time this month.
Event-Driven Ansible (EDA) eliminates this. The event fires, EDA catches it, runs the remediation playbook, and the SRE sleeps through the night.
Event Source โ EDA Controller โ Rule Match โ Action (Playbook)
Examples:
Prometheus alert โ EDA โ disk_cleanup.yml
Kafka message โ EDA โ scale_deployment.yml
Webhook โ EDA โ security_response.ymlpip install ansible-rulebook ansible-runner
# Or via container
podman run -it --rm \
-v ./rulebooks:/rulebooks \
-v ./playbooks:/playbooks \
quay.io/ansible/ansible-rulebook:latest# rulebooks/disk-cleanup.yml
---
- name: Auto-remediate disk space alerts
hosts: all
sources:
- ansible.eda.alertmanager:
host: 0.0.0.0
port: 5000
rules:
- name: Disk space critical
condition: event.alerts[0].labels.alertname == "DiskSpaceCritical"
action:
run_playbook:
name: playbooks/disk-cleanup.yml
extra_vars:
target_host: "{{ event.alerts[0].labels.instance }}"
mount_point: "{{ event.alerts[0].labels.mountpoint }}"
- name: Pod CrashLoopBackOff
condition: event.alerts[0].labels.alertname == "PodCrashLooping"
action:
run_playbook:
name: playbooks/pod-restart.yml
extra_vars:
namespace: "{{ event.alerts[0].labels.namespace }}"
pod: "{{ event.alerts[0].labels.pod }}"# playbooks/disk-cleanup.yml
---
- name: Clean up disk space
hosts: "{{ target_host }}"
become: yes
tasks:
- name: Get current disk usage
command: df -h {{ mount_point }}
register: disk_before
- name: Clean old logs (>7 days)
find:
paths: /var/log
age: 7d
recurse: yes
file_type: file
register: old_logs
- name: Remove old logs
file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_logs.files }}"
when: old_logs.files | length > 0
- name: Clean package cache
apt:
autoclean: yes
autoremove: yes
- name: Clean old journal logs
command: journalctl --vacuum-time=3d
- name: Get disk usage after cleanup
command: df -h {{ mount_point }}
register: disk_after
- name: Notify Slack
uri:
url: "{{ slack_webhook }}"
method: POST
body_format: json
body:
text: |
โ
Auto-remediated disk space on {{ target_host }}
Before: {{ disk_before.stdout_lines[1] }}
After: {{ disk_after.stdout_lines[1] }}
Cleaned: {{ old_logs.files | length }} old log files- name: Certificate expiring
condition: event.alerts[0].labels.alertname == "SSLCertExpiring"
action:
run_playbook:
name: playbooks/renew-cert.yml- name: Node NotReady
condition: event.alerts[0].labels.alertname == "KubeNodeNotReady"
action:
run_playbook:
name: playbooks/node-recovery.yml- name: Brute force detected
condition: event.alerts[0].labels.alertname == "SSHBruteForce"
action:
run_playbook:
name: playbooks/block-ip.yml
extra_vars:
source_ip: "{{ event.alerts[0].labels.source_ip }}"Auto-remediation without guardrails is dangerous. Always include:
# In every remediation playbook
- name: Safety checks
block:
- name: Check this host isn't in maintenance
uri:
url: "{{ cmdb_api }}/hosts/{{ inventory_hostname }}/status"
register: host_status
failed_when: host_status.json.status == "maintenance"
- name: Check remediation hasn't run in last hour
stat:
path: "/tmp/eda-lock-{{ ansible_play_name }}"
register: lock_file
- name: Skip if recently remediated
fail:
msg: "Remediation already ran recently โ escalating to human"
when: lock_file.stat.exists and (ansible_date_time.epoch | int - lock_file.stat.mtime | int) < 3600
- name: Create lock file
file:
path: "/tmp/eda-lock-{{ ansible_play_name }}"
state: touchEDA pairs perfectly with the Prometheus/Alertmanager stack I detail at Kubernetes Recipes:
# Alertmanager config โ route alerts to EDA
route:
receiver: eda
routes:
- match:
auto_remediate: "true"
receiver: eda
receivers:
- name: eda
webhook_configs:
- url: http://eda-controller:5000/endpointOnly alerts tagged auto_remediate: "true" go to EDA. Everything else goes to PagerDuty as usual. Start small, prove value, expand gradually.
For comprehensive Ansible automation patterns, see Ansible Pilot and step-by-step examples at Ansible by Example.
Before EDA:
15 auto-remediable alerts/week ร 30 min/alert ร $80/hr = $600/week
After EDA:
15 alerts auto-resolved in seconds = $0/week
Setup cost: 2 days of engineering
Payback period: 5 daysLet humans handle novel problems. Let EDA handle the repeatable ones.
AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.
LLMs can write Ansible playbooks, but should you trust them? Here's how to use AI for playbook generation with proper validation, linting, and safety guardrails.
Design, build, and distribute Ansible Collections that your team will actually reuse. Naming conventions, testing, versioning, and Galaxy publishing.
Automate the provisioning of GPU compute clusters with Ansible. NVIDIA driver installation, CUDA setup, container runtime configuration, and health checks.