🎓 Claude Code Masterclass | Learn AI-assisted development on Udemy — plus the companion book on Leanpub & Amazon. Start Learning

Rootly at KubeCon EU 2026: AI SRE Agents That Go On-Call

Met JJ Tang, co-founder and CEO of Rootly, at KubeCon EU 2026 Amsterdam. AI SRE platform that triages incidents automatically, only paging humans when.

April 10, 2026 · 2 min read

Nobody likes being woken up at 3 AM for a pager alert that could have waited. That is exactly why I was so excited to run into JJ Tang, co-founder and CEO of Rootly, while walking the floor at KubeCon in Amsterdam.

Luca with JJ Tang, co-founder and CEO of Rootly, at KubeCon EU 2026

AI Agents That Go On-Call For You

We had a brilliant conversation about how they are completely reimagining incident management and the traditional “on-call” experience.

Rootly is building an AI SRE platform that actually acts as your first line of defense. Instead of immediately paging an engineer when something breaks, their AI agents go on-call for you. They automatically:

Triage the issue — analyze logs, metrics, and alerts to understand what happened
Determine impact and severity — assess blast radius and business impact
Attempt automated remediation — run predefined playbooks for known failure modes
Only wake up a human when truly necessary — escalate intelligently, not reflexively

This is fundamentally different from traditional alerting, which treats every threshold breach as equally urgent and pages whoever is next on the rotation.

Who Trusts Rootly

It is no surprise that engineering teams at DoorDash, NVIDIA, and LinkedIn are already trusting them to keep their cloud-native environments running. These are organizations with massive-scale distributed systems where:

Incidents happen frequently across thousands of microservices
Context switching costs are enormous for on-call engineers
Alert fatigue is a real cause of burnout and turnover
Mean time to resolution directly impacts revenue

Why This Matters

For me, this is the kind of innovation that makes our industry better. It is not just about building smarter tools or using AI for the sake of it — it is about genuinely improving the quality of life for developers and SREs so they do not burn out.

The on-call problem is one of the most persistent quality-of-life issues in engineering. Every SRE I know has stories about false alarms at 3 AM, cascading pages for non-critical issues, and the constant low-grade anxiety of being on rotation. If AI agents can absorb the initial triage — the “is this actually a problem?” phase — that changes the entire experience.

For teams running AI workloads on Kubernetes, this becomes even more critical. GPU inference pipelines, multi-node model serving, and disaggregated serving with Dynamo introduce new failure modes that traditional runbooks do not cover. Having an AI agent that can triage GPU-specific issues — OOM kills, NCCL timeouts, KV cache exhaustion — before paging a human is a significant operational improvement.

Learn More

If you want to see how they are solving the on-call nightmare: rootly.com

About the Author

I am Luca Berton, AI and Cloud Advisor. I help enterprises build reliable, observable AI platforms that do not burn out their teams. Book a consultation.

Related Articles

macOS ENFILE Error: Too Many Open Files — Fix Guide

macOS ENFILE Error: Too Many Open Files — Fix Guide

Fix the macOS ENFILE file table overflow error when builds fail. Diagnose kern.maxfiles exhaustion with sysctl and lsof, and find the leaking process.

Google Analytics property restore troubleshooting

Restore a Deleted Google Analytics 4 Property

Deleted a GA4 property by mistake? Restore it from Google Analytics Trash within 35 days and keep the original measurement ID.

Fix OpenClaw ERR_STRING_TOO_LONG bloated session error

Fix OpenClaw ERR_STRING_TOO_LONG Session Error

OpenClaw agent fails with 'Cannot create a string longer than 0x1fffffe8 characters'? It's a bloated session JSONL hitting Node's string limit. Here's the fix.

Google Search Console growth analysis with a Python script

Turn Google Search Console Data Into a Growth Plan

Run one dependency-free Python script on your Search Console export to surface the SEO levers that move traffic: CTR bands and striking-distance pages.

Need expert guidance?

Free 30-min consultation with Luca Berton

Free 30-min AI & Cloud consultation