Self-Healing Infrastructure
The promise of self-healing infrastructure has been around for years, but LLMs have finally made it practical. Instead of writing rigid if-then rules for every possible failure, you can build an agent that understands context and selects the right remediation.
Iβve built several of these systems. Hereβs the practical blueprint.
Architecture Overview
Monitoring Stack (Prometheus/Alertmanager)
β webhook
AI Decision Engine (LLM)
β selects
Ansible Playbook Library
β executes
Infrastructure (via Ansible)
β verifies
Health Check β Success/EscalateStep 1: Alert Ingestion
Configure Alertmanager to send webhooks to your agent:
# alertmanager.yml
receivers:
- name: ai-agent
webhook_configs:
- url: 'http://healing-agent.internal:8080/alert'
send_resolved: true
max_alerts: 5Your agent receives structured alert data:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class AlertPayload(BaseModel):
alerts: list[dict]
@app.post("/alert")
async def handle_alert(payload: AlertPayload):
for alert in payload.alerts:
if alert["status"] == "firing":
await process_alert(alert)Step 2: LLM Decision Engine
The LLM analyzes the alert and selects from your playbook library:
import json
PLAYBOOK_CATALOG = {
"restart_service": {
"description": "Restart a systemd service",
"params": ["service_name", "target_hosts"],
"risk": "low",
"playbook": "playbooks/restart-service.yml",
},
"clear_disk_space": {
"description": "Clean temp files and old logs to free disk space",
"params": ["target_hosts", "min_free_gb"],
"risk": "low",
"playbook": "playbooks/clear-disk.yml",
},
"scale_pods": {
"description": "Scale Kubernetes deployment replicas",
"params": ["namespace", "deployment", "replicas"],
"risk": "medium",
"playbook": "playbooks/scale-deployment.yml",
},
"failover_database": {
"description": "Promote database replica to primary",
"params": ["cluster_name"],
"risk": "high",
"playbook": "playbooks/db-failover.yml",
},
}
async def decide_action(alert: dict) -> dict:
prompt = f"""You are an infrastructure reliability agent.
Analyze this alert and select the best remediation action.
Alert: {json.dumps(alert, indent=2)}
Available playbooks: {json.dumps(PLAYBOOK_CATALOG, indent=2)}
Respond with JSON: {{"playbook": "name", "params": {{}}, "reasoning": "..."}}
If no automated action is appropriate, respond with {{"playbook": "escalate", "reasoning": "..."}}
"""
response = await llm.generate(prompt, temperature=0)
return json.loads(response)Step 3: Safety Gates
Never skip the safety layer. Risk-based approval:
async def execute_with_safety(decision: dict, alert: dict):
playbook = PLAYBOOK_CATALOG.get(decision["playbook"])
if not playbook:
await escalate_to_human(alert, "Unknown playbook selected")
return
risk = playbook["risk"]
if risk == "low":
# Auto-execute, notify after
result = await run_ansible(playbook, decision["params"])
await notify_team(f"Auto-remediated: {decision['reasoning']}")
elif risk == "medium":
# Execute with 5-minute delay (cancel window)
msg = await notify_team(
f"β οΈ Will execute in 5 min: {playbook['description']}\n"
f"Reason: {decision['reasoning']}\n"
f"React with β to cancel."
)
await asyncio.sleep(300)
if not await was_cancelled(msg):
result = await run_ansible(playbook, decision["params"])
elif risk == "high":
# Always require human approval
await escalate_to_human(alert, decision["reasoning"])Step 4: Ansible Execution
import ansible_runner
async def run_ansible(playbook: dict, params: dict) -> dict:
result = ansible_runner.run(
playbook=playbook["playbook"],
extravars=params,
timeout=300,
)
return {
"status": result.status,
"rc": result.rc,
"stats": result.stats,
}Example playbook for disk cleanup:
# playbooks/clear-disk.yml
---
- name: Clear disk space
hosts: "{{ target_hosts }}"
become: true
tasks:
- name: Remove old journal logs
command: journalctl --vacuum-time=7d
- name: Clean package cache
ansible.builtin.dnf:
autoremove: true
- name: Remove temp files older than 7 days
find:
paths: /tmp
age: 7d
recurse: true
register: old_tmp
- name: Delete old temp files
file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_tmp.files }}"
- name: Verify free space
assert:
that:
- ansible_facts.mounts | selectattr('mount', 'equalto', '/')
| map(attribute='size_available') | first > (min_free_gb | int * 1073741824)
fail_msg: "Disk space still below {{ min_free_gb }}GB after cleanup"Step 5: Verification Loop
After remediation, verify the fix worked:
async def verify_remediation(alert: dict, action_result: dict):
# Wait for metrics to stabilize
await asyncio.sleep(60)
# Check if the alert resolved
resolved = await check_alert_resolved(alert["fingerprint"])
if resolved:
await log_success(alert, action_result)
else:
# Try once more or escalate
await escalate_to_human(
alert,
f"Auto-remediation attempted but alert persists. "
f"Action taken: {action_result}"
)Results Iβve Seen
From a recent deployment:
- Mean Time to Remediation dropped from 23 minutes to 2 minutes for low-risk issues
- 70% of disk space alerts auto-resolved without human intervention
- Service restart alerts handled in under 30 seconds
- False positive rate for LLM decisions: 3% (all caught by safety gates)
Key Takeaways
- Start with low-risk playbooks only β disk cleanup, service restarts, log rotation
- The LLM selects, Ansible executes β never let the LLM generate arbitrary commands
- Safety gates are mandatory β risk-based approval tiers
- Always verify β check that the remediation actually worked
- Log everything β every decision, every action, every outcome
The best self-healing systems augment your team, not replace them. Start small, build trust, expand gradually.
Want to build self-healing infrastructure for your team? I help organizations design AI-powered automation systems. Letβs connect.
