Why AI-Powered Alerting
Traditional alerting is noisy. Prometheus fires alerts, Alertmanager routes them, and you get 47 messages at 3 AM because a disk is at 81%. OpenClaw adds intelligence: it analyzes alerts, correlates events, and only wakes you up when it matters.
Architecture
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Prometheus ββββββΆβ Alertmanager ββββββΆβ OpenClaw β
β (metrics) β β (routing) β β (analysis) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β² β
ββββββββ΄ββββββββ Discord/WhatsApp
β Grafana β (smart alerts)
β (dashboards)β
ββββββββββββββββ
Setting Up the Webhook
Configure Alertmanager to send to OpenClaw:
# alertmanager.yml
receivers:
- name: 'openclaw'
webhook_configs:
- url: 'http://openclaw-pi:3000/api/webhook'
send_resolved: true
route:
receiver: 'openclaw'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
Smart Alert Processing
When OpenClaw receives a Prometheus alert, it doesnβt just forward it. It:
- Checks severity β critical alerts wake you up, warnings wait for morning
- Correlates β βdisk fullβ + βbackup runningβ = expected, donβt alert
- Suggests fixes β βNode memory at 95%. Top process: java (elasticsearch). Consider increasing heap or adding a node.β
- Tracks trends β βThis is the 3rd time this week node-2 has high CPU. Might need a hardware upgrade.β
Example Alerts
Raw Prometheus alert:
{"labels":{"alertname":"HighMemoryUsage","instance":"node-2:9100","severity":"warning"},"annotations":{"summary":"Memory usage above 90%","value":"93.2%"}}
What OpenClaw sends:
β οΈ node-2: Memory at 93% Top consumers: elasticsearch (4.2GB), prometheus (1.8GB), grafana (600MB) Trend: Memory has been climbing 2% per day since Monday. Suggestion: Elasticsearch heap is undersized. Consider ES_JAVA_OPTS=-Xms4g -Xmx4g This is a warning β Iβll escalate if it hits 97%.
Grafana Integration
OpenClaw can also query Grafana dashboards on demand:
βHowβs the cluster doing?β
# OpenClaw queries Grafana API
GET /api/datasources/proxy/1/api/v1/query?query=up
# Response
All 4 nodes reporting. CPU avg: 23%. Memory avg: 61%.
Network: 2.4 Gbps aggregate throughput. No anomalies.
βShow me the CPU graph for the last hourβ
# OpenClaw fetches a Grafana rendered panel
GET /render/d-solo/abc123/cluster?panelId=2&from=now-1h&to=now&width=800&height=400
For known issues, OpenClaw can fix things automatically:
# Auto-remediation rules (in OpenClaw skill)
rules:
- alert: DiskSpaceLow
condition: disk_usage > 90%
action: |
1. Find and delete files in /tmp older than 7 days
2. Clear docker image cache
3. Report what was cleaned and new disk usage
- alert: PodCrashLooping
condition: restart_count > 5
action: |
1. Collect pod logs (last 50 lines)
2. Analyze error pattern
3. If OOMKilled: suggest memory limit increase
4. If CrashBackOff: report logs for human review
Cost Comparison
PagerDuty: $21/user/month
Opsgenie: $9/user/month
OpenClaw + Pi: $10/month (Copilot Pro)
Plus OpenClaw does way more than just alerting β itβs your full AI assistant that also handles monitoring.