Self-Healing Infrastructure with Ansible & LLMs

Self-Healing Infrastructure

The promise of self-healing infrastructure has been around for years, but LLMs have finally made it practical. Instead of writing rigid if-then rules for every possible failure, you can build an agent that understands context and selects the right remediation.

I’ve built several of these systems. Here’s the practical blueprint.

Architecture Overview

Monitoring Stack (Prometheus/Alertmanager)
         ↓ webhook
   AI Decision Engine (LLM)
         ↓ selects
   Ansible Playbook Library
         ↓ executes
   Infrastructure (via Ansible)
         ↓ verifies
   Health Check → Success/Escalate

Step 1: Alert Ingestion

Configure Alertmanager to send webhooks to your agent:

# alertmanager.yml
receivers:
  - name: ai-agent
    webhook_configs:
      - url: 'http://healing-agent.internal:8080/alert'
        send_resolved: true
        max_alerts: 5

Your agent receives structured alert data:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class AlertPayload(BaseModel):
    alerts: list[dict]

@app.post("/alert")
async def handle_alert(payload: AlertPayload):
    for alert in payload.alerts:
        if alert["status"] == "firing":
            await process_alert(alert)

Step 2: LLM Decision Engine

The LLM analyzes the alert and selects from your playbook library:

import json

PLAYBOOK_CATALOG = {
    "restart_service": {
        "description": "Restart a systemd service",
        "params": ["service_name", "target_hosts"],
        "risk": "low",
        "playbook": "playbooks/restart-service.yml",
    },
    "clear_disk_space": {
        "description": "Clean temp files and old logs to free disk space",
        "params": ["target_hosts", "min_free_gb"],
        "risk": "low",
        "playbook": "playbooks/clear-disk.yml",
    },
    "scale_pods": {
        "description": "Scale Kubernetes deployment replicas",
        "params": ["namespace", "deployment", "replicas"],
        "risk": "medium",
        "playbook": "playbooks/scale-deployment.yml",
    },
    "failover_database": {
        "description": "Promote database replica to primary",
        "params": ["cluster_name"],
        "risk": "high",
        "playbook": "playbooks/db-failover.yml",
    },
}

async def decide_action(alert: dict) -> dict:
    prompt = f"""You are an infrastructure reliability agent. 
    
Analyze this alert and select the best remediation action.

Alert: {json.dumps(alert, indent=2)}

Available playbooks: {json.dumps(PLAYBOOK_CATALOG, indent=2)}

Respond with JSON: {{"playbook": "name", "params": {{}}, "reasoning": "..."}}
If no automated action is appropriate, respond with {{"playbook": "escalate", "reasoning": "..."}}
"""
    
    response = await llm.generate(prompt, temperature=0)
    return json.loads(response)

Step 3: Safety Gates

Never skip the safety layer. Risk-based approval:

async def execute_with_safety(decision: dict, alert: dict):
    playbook = PLAYBOOK_CATALOG.get(decision["playbook"])
    
    if not playbook:
        await escalate_to_human(alert, "Unknown playbook selected")
        return
    
    risk = playbook["risk"]
    
    if risk == "low":
        # Auto-execute, notify after
        result = await run_ansible(playbook, decision["params"])
        await notify_team(f"Auto-remediated: {decision['reasoning']}")
        
    elif risk == "medium":
        # Execute with 5-minute delay (cancel window)
        msg = await notify_team(
            f"⚠️ Will execute in 5 min: {playbook['description']}\n"
            f"Reason: {decision['reasoning']}\n"
            f"React with ❌ to cancel."
        )
        await asyncio.sleep(300)
        if not await was_cancelled(msg):
            result = await run_ansible(playbook, decision["params"])
            
    elif risk == "high":
        # Always require human approval
        await escalate_to_human(alert, decision["reasoning"])

Step 4: Ansible Execution

import ansible_runner

async def run_ansible(playbook: dict, params: dict) -> dict:
    result = ansible_runner.run(
        playbook=playbook["playbook"],
        extravars=params,
        timeout=300,
    )
    
    return {
        "status": result.status,
        "rc": result.rc,
        "stats": result.stats,
    }

Example playbook for disk cleanup:

# playbooks/clear-disk.yml
---
- name: Clear disk space
  hosts: "{{ target_hosts }}"
  become: true
  tasks:
    - name: Remove old journal logs
      command: journalctl --vacuum-time=7d
      
    - name: Clean package cache
      ansible.builtin.dnf:
        autoremove: true
        
    - name: Remove temp files older than 7 days
      find:
        paths: /tmp
        age: 7d
        recurse: true
      register: old_tmp
      
    - name: Delete old temp files
      file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_tmp.files }}"
      
    - name: Verify free space
      assert:
        that:
          - ansible_facts.mounts | selectattr('mount', 'equalto', '/') 
            | map(attribute='size_available') | first > (min_free_gb | int * 1073741824)
        fail_msg: "Disk space still below {{ min_free_gb }}GB after cleanup"

Step 5: Verification Loop

After remediation, verify the fix worked:

async def verify_remediation(alert: dict, action_result: dict):
    # Wait for metrics to stabilize
    await asyncio.sleep(60)
    
    # Check if the alert resolved
    resolved = await check_alert_resolved(alert["fingerprint"])
    
    if resolved:
        await log_success(alert, action_result)
    else:
        # Try once more or escalate
        await escalate_to_human(
            alert, 
            f"Auto-remediation attempted but alert persists. "
            f"Action taken: {action_result}"
        )

Results I’ve Seen

From a recent deployment:

Mean Time to Remediation dropped from 23 minutes to 2 minutes for low-risk issues
70% of disk space alerts auto-resolved without human intervention
Service restart alerts handled in under 30 seconds
False positive rate for LLM decisions: 3% (all caught by safety gates)

Key Takeaways

Start with low-risk playbooks only — disk cleanup, service restarts, log rotation
The LLM selects, Ansible executes — never let the LLM generate arbitrary commands
Safety gates are mandatory — risk-based approval tiers
Always verify — check that the remediation actually worked
Log everything — every decision, every action, every outcome

The best self-healing systems augment your team, not replace them. Start small, build trust, expand gradually.

Want to build self-healing infrastructure for your team? I help organizations design AI-powered automation systems. Let’s connect.

How to Build a Self-Healing Infrastructure Agent with...