Nobody likes being woken up at 3 AM for a pager alert that could have waited. That is exactly why I was so excited to run into JJ Tang, co-founder and CEO of Rootly, while walking the floor at KubeCon in Amsterdam.

AI Agents That Go On-Call For You
We had a brilliant conversation about how they are completely reimagining incident management and the traditional βon-callβ experience.
Rootly is building an AI SRE platform that actually acts as your first line of defense. Instead of immediately paging an engineer when something breaks, their AI agents go on-call for you. They automatically:
- Triage the issue β analyze logs, metrics, and alerts to understand what happened
- Determine impact and severity β assess blast radius and business impact
- Attempt automated remediation β run predefined playbooks for known failure modes
- Only wake up a human when truly necessary β escalate intelligently, not reflexively
This is fundamentally different from traditional alerting, which treats every threshold breach as equally urgent and pages whoever is next on the rotation.
Who Trusts Rootly
It is no surprise that engineering teams at DoorDash, NVIDIA, and LinkedIn are already trusting them to keep their cloud-native environments running. These are organizations with massive-scale distributed systems where:
- Incidents happen frequently across thousands of microservices
- Context switching costs are enormous for on-call engineers
- Alert fatigue is a real cause of burnout and turnover
- Mean time to resolution directly impacts revenue
Why This Matters
For me, this is the kind of innovation that makes our industry better. It is not just about building smarter tools or using AI for the sake of it β it is about genuinely improving the quality of life for developers and SREs so they do not burn out.
The on-call problem is one of the most persistent quality-of-life issues in engineering. Every SRE I know has stories about false alarms at 3 AM, cascading pages for non-critical issues, and the constant low-grade anxiety of being on rotation. If AI agents can absorb the initial triage β the βis this actually a problem?β phase β that changes the entire experience.
For teams running AI workloads on Kubernetes, this becomes even more critical. GPU inference pipelines, multi-node model serving, and disaggregated serving with Dynamo introduce new failure modes that traditional runbooks do not cover. Having an AI agent that can triage GPU-specific issues β OOM kills, NCCL timeouts, KV cache exhaustion β before paging a human is a significant operational improvement.
Learn More
If you want to see how they are solving the on-call nightmare: rootly.com
Related Posts
- AI on Kubernetes: Observability for Inference Pipelines
- OpenObserve at KubeCon Europe 2026
- KubeCon Europe 2026 in Numbers
- KubeCon Europe 2026 Community Connections
About the Author
I am Luca Berton, AI and Cloud Advisor. I help enterprises build reliable, observable AI platforms that do not burn out their teams. Book a consultation.