Skip to main content
🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Automation

How to Build a Self-Healing Infrastructure Agent with Ansible and LLMs

Luca Berton β€’ β€’ 1 min read
#ai#ansible#llm#automation#infrastructure

πŸ”§ Self-Healing Infrastructure

The promise of self-healing infrastructure has been around for years, but LLMs have finally made it practical. Instead of writing rigid if-then rules for every possible failure, you can build an agent that understands context and selects the right remediation.

I’ve built several of these systems. Here’s the practical blueprint.

Architecture Overview

Monitoring Stack (Prometheus/Alertmanager)
         ↓ webhook
   AI Decision Engine (LLM)
         ↓ selects
   Ansible Playbook Library
         ↓ executes
   Infrastructure (via Ansible)
         ↓ verifies
   Health Check β†’ Success/Escalate

Step 1: Alert Ingestion

Configure Alertmanager to send webhooks to your agent:

# alertmanager.yml
receivers:
  - name: ai-agent
    webhook_configs:
      - url: 'http://healing-agent.internal:8080/alert'
        send_resolved: true
        max_alerts: 5

Your agent receives structured alert data:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class AlertPayload(BaseModel):
    alerts: list[dict]

@app.post("/alert")
async def handle_alert(payload: AlertPayload):
    for alert in payload.alerts:
        if alert["status"] == "firing":
            await process_alert(alert)

Step 2: LLM Decision Engine

The LLM analyzes the alert and selects from your playbook library:

import json

PLAYBOOK_CATALOG = {
    "restart_service": {
        "description": "Restart a systemd service",
        "params": ["service_name", "target_hosts"],
        "risk": "low",
        "playbook": "playbooks/restart-service.yml",
    },
    "clear_disk_space": {
        "description": "Clean temp files and old logs to free disk space",
        "params": ["target_hosts", "min_free_gb"],
        "risk": "low",
        "playbook": "playbooks/clear-disk.yml",
    },
    "scale_pods": {
        "description": "Scale Kubernetes deployment replicas",
        "params": ["namespace", "deployment", "replicas"],
        "risk": "medium",
        "playbook": "playbooks/scale-deployment.yml",
    },
    "failover_database": {
        "description": "Promote database replica to primary",
        "params": ["cluster_name"],
        "risk": "high",
        "playbook": "playbooks/db-failover.yml",
    },
}

async def decide_action(alert: dict) -> dict:
    prompt = f"""You are an infrastructure reliability agent. 
    
Analyze this alert and select the best remediation action.

Alert: {json.dumps(alert, indent=2)}

Available playbooks: {json.dumps(PLAYBOOK_CATALOG, indent=2)}

Respond with JSON: {{"playbook": "name", "params": {{}}, "reasoning": "..."}}
If no automated action is appropriate, respond with {{"playbook": "escalate", "reasoning": "..."}}
"""
    
    response = await llm.generate(prompt, temperature=0)
    return json.loads(response)

Step 3: Safety Gates

Never skip the safety layer. Risk-based approval:

async def execute_with_safety(decision: dict, alert: dict):
    playbook = PLAYBOOK_CATALOG.get(decision["playbook"])
    
    if not playbook:
        await escalate_to_human(alert, "Unknown playbook selected")
        return
    
    risk = playbook["risk"]
    
    if risk == "low":
        # Auto-execute, notify after
        result = await run_ansible(playbook, decision["params"])
        await notify_team(f"Auto-remediated: {decision['reasoning']}")
        
    elif risk == "medium":
        # Execute with 5-minute delay (cancel window)
        msg = await notify_team(
            f"⚠️ Will execute in 5 min: {playbook['description']}\n"
            f"Reason: {decision['reasoning']}\n"
            f"React with ❌ to cancel."
        )
        await asyncio.sleep(300)
        if not await was_cancelled(msg):
            result = await run_ansible(playbook, decision["params"])
            
    elif risk == "high":
        # Always require human approval
        await escalate_to_human(alert, decision["reasoning"])

Step 4: Ansible Execution

import ansible_runner

async def run_ansible(playbook: dict, params: dict) -> dict:
    result = ansible_runner.run(
        playbook=playbook["playbook"],
        extravars=params,
        timeout=300,
    )
    
    return {
        "status": result.status,
        "rc": result.rc,
        "stats": result.stats,
    }

Example playbook for disk cleanup:

# playbooks/clear-disk.yml
---
- name: Clear disk space
  hosts: "{{ target_hosts }}"
  become: true
  tasks:
    - name: Remove old journal logs
      command: journalctl --vacuum-time=7d
      
    - name: Clean package cache
      ansible.builtin.dnf:
        autoremove: true
        
    - name: Remove temp files older than 7 days
      find:
        paths: /tmp
        age: 7d
        recurse: true
      register: old_tmp
      
    - name: Delete old temp files
      file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_tmp.files }}"
      
    - name: Verify free space
      assert:
        that:
          - ansible_facts.mounts | selectattr('mount', 'equalto', '/') 
            | map(attribute='size_available') | first > (min_free_gb | int * 1073741824)
        fail_msg: "Disk space still below {{ min_free_gb }}GB after cleanup"

Step 5: Verification Loop

After remediation, verify the fix worked:

async def verify_remediation(alert: dict, action_result: dict):
    # Wait for metrics to stabilize
    await asyncio.sleep(60)
    
    # Check if the alert resolved
    resolved = await check_alert_resolved(alert["fingerprint"])
    
    if resolved:
        await log_success(alert, action_result)
    else:
        # Try once more or escalate
        await escalate_to_human(
            alert, 
            f"Auto-remediation attempted but alert persists. "
            f"Action taken: {action_result}"
        )

πŸ“Š Results I’ve Seen

From a recent deployment:

  • Mean Time to Remediation dropped from 23 minutes to 2 minutes for low-risk issues
  • 70% of disk space alerts auto-resolved without human intervention
  • Service restart alerts handled in under 30 seconds
  • False positive rate for LLM decisions: 3% (all caught by safety gates)

Key Takeaways

  1. Start with low-risk playbooks only β€” disk cleanup, service restarts, log rotation
  2. The LLM selects, Ansible executes β€” never let the LLM generate arbitrary commands
  3. Safety gates are mandatory β€” risk-based approval tiers
  4. Always verify β€” check that the remediation actually worked
  5. Log everything β€” every decision, every action, every outcome

The best self-healing systems augment your team, not replace them. Start small, build trust, expand gradually.


Want to build self-healing infrastructure for your team? I help organizations design AI-powered automation systems. Let’s connect.

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut