Skip to main content
๐ŸŽค Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
๐ŸŽค Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Automation

Event-Driven Ansible for Automated Incident Response

Luca Berton โ€ข โ€ข 1 min read
#ansible#eda#event-driven#incident-response#automation#sre

The 3 AM Alert Problem

PagerDuty fires. Disk usage on prod-db-01 hit 90%. An SRE wakes up, SSHs in, cleans up old logs, and goes back to sleep. The exact same alert, the exact same fix, for the third time this month.

Event-Driven Ansible (EDA) eliminates this. The event fires, EDA catches it, runs the remediation playbook, and the SRE sleeps through the night.

How EDA Works

Event Source โ†’ EDA Controller โ†’ Rule Match โ†’ Action (Playbook)

Examples:
  Prometheus alert โ†’ EDA โ†’ disk_cleanup.yml
  Kafka message   โ†’ EDA โ†’ scale_deployment.yml
  Webhook         โ†’ EDA โ†’ security_response.yml

Setting Up EDA

Install

pip install ansible-rulebook ansible-runner

# Or via container
podman run -it --rm \
  -v ./rulebooks:/rulebooks \
  -v ./playbooks:/playbooks \
  quay.io/ansible/ansible-rulebook:latest

Your First Rulebook

# rulebooks/disk-cleanup.yml
---
- name: Auto-remediate disk space alerts
  hosts: all
  sources:
    - ansible.eda.alertmanager:
        host: 0.0.0.0
        port: 5000

  rules:
    - name: Disk space critical
      condition: event.alerts[0].labels.alertname == "DiskSpaceCritical"
      action:
        run_playbook:
          name: playbooks/disk-cleanup.yml
          extra_vars:
            target_host: "{{ event.alerts[0].labels.instance }}"
            mount_point: "{{ event.alerts[0].labels.mountpoint }}"

    - name: Pod CrashLoopBackOff
      condition: event.alerts[0].labels.alertname == "PodCrashLooping"
      action:
        run_playbook:
          name: playbooks/pod-restart.yml
          extra_vars:
            namespace: "{{ event.alerts[0].labels.namespace }}"
            pod: "{{ event.alerts[0].labels.pod }}"

The Remediation Playbook

# playbooks/disk-cleanup.yml
---
- name: Clean up disk space
  hosts: "{{ target_host }}"
  become: yes
  tasks:
    - name: Get current disk usage
      command: df -h {{ mount_point }}
      register: disk_before

    - name: Clean old logs (>7 days)
      find:
        paths: /var/log
        age: 7d
        recurse: yes
        file_type: file
      register: old_logs

    - name: Remove old logs
      file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_logs.files }}"
      when: old_logs.files | length > 0

    - name: Clean package cache
      apt:
        autoclean: yes
        autoremove: yes

    - name: Clean old journal logs
      command: journalctl --vacuum-time=3d

    - name: Get disk usage after cleanup
      command: df -h {{ mount_point }}
      register: disk_after

    - name: Notify Slack
      uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: |
            โœ… Auto-remediated disk space on {{ target_host }}
            Before: {{ disk_before.stdout_lines[1] }}
            After: {{ disk_after.stdout_lines[1] }}
            Cleaned: {{ old_logs.files | length }} old log files

Common Auto-Remediation Patterns

SSL Certificate Renewal

- name: Certificate expiring
  condition: event.alerts[0].labels.alertname == "SSLCertExpiring"
  action:
    run_playbook:
      name: playbooks/renew-cert.yml

Kubernetes Node Not Ready

- name: Node NotReady
  condition: event.alerts[0].labels.alertname == "KubeNodeNotReady"
  action:
    run_playbook:
      name: playbooks/node-recovery.yml

Security: Suspicious Login

- name: Brute force detected
  condition: event.alerts[0].labels.alertname == "SSHBruteForce"
  action:
    run_playbook:
      name: playbooks/block-ip.yml
      extra_vars:
        source_ip: "{{ event.alerts[0].labels.source_ip }}"

Safety Guardrails

Auto-remediation without guardrails is dangerous. Always include:

# In every remediation playbook
- name: Safety checks
  block:
    - name: Check this host isn't in maintenance
      uri:
        url: "{{ cmdb_api }}/hosts/{{ inventory_hostname }}/status"
      register: host_status
      failed_when: host_status.json.status == "maintenance"

    - name: Check remediation hasn't run in last hour
      stat:
        path: "/tmp/eda-lock-{{ ansible_play_name }}"
      register: lock_file

    - name: Skip if recently remediated
      fail:
        msg: "Remediation already ran recently โ€” escalating to human"
      when: lock_file.stat.exists and (ansible_date_time.epoch | int - lock_file.stat.mtime | int) < 3600

    - name: Create lock file
      file:
        path: "/tmp/eda-lock-{{ ansible_play_name }}"
        state: touch

Integration with Kubernetes Monitoring

EDA pairs perfectly with the Prometheus/Alertmanager stack I detail at Kubernetes Recipes:

# Alertmanager config โ€” route alerts to EDA
route:
  receiver: eda
  routes:
    - match:
        auto_remediate: "true"
      receiver: eda

receivers:
  - name: eda
    webhook_configs:
      - url: http://eda-controller:5000/endpoint

Only alerts tagged auto_remediate: "true" go to EDA. Everything else goes to PagerDuty as usual. Start small, prove value, expand gradually.

For comprehensive Ansible automation patterns, see Ansible Pilot and step-by-step examples at Ansible by Example.

The ROI

Before EDA:
  15 auto-remediable alerts/week ร— 30 min/alert ร— $80/hr = $600/week

After EDA:
  15 alerts auto-resolved in seconds = $0/week
  Setup cost: 2 days of engineering

Payback period: 5 days

Let humans handle novel problems. Let EDA handle the repeatable ones.

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut