The Kubernetes operator pattern β a custom controller that watches resources and reconciles state β is a natural fit for AI-powered agents. Instead of hardcoded reconciliation logic, the controller consults an LLM to decide what action to take.
Why Operators Make Good Agents
Kubernetes operators already implement the core agent loop:
- Watch β observe cluster events via the API server
- Analyze β determine if current state matches desired state
- Act β create, update, or delete resources to reconcile
Adding an LLM to step 2 transforms a rigid operator into an adaptive agent that can handle situations the developer did not anticipate.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β AI Agent Operator β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Watcher ββ β LLM ββ β Executor β β
β β (K8s API)β β Reasoner β β (kubectl) β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β
β ββββββββββββ β
β β Context β β
β β Gatherer β β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββImplementation Example
Using Python with kopf (Kubernetes Operator Framework):
import kopf
import openai
from kubernetes import client
@kopf.on.event('', 'v1', 'pods')
async def handle_pod_event(event, logger, **kwargs):
pod = event['object']
# Only act on unhealthy pods
if not is_unhealthy(pod):
return
# Gather context
context = {
"pod_name": pod['metadata']['name'],
"namespace": pod['metadata']['namespace'],
"status": pod['status'],
"events": get_recent_events(pod),
"logs": get_pod_logs(pod, tail=50),
"node_metrics": get_node_metrics(pod),
}
# Ask LLM for diagnosis and action
response = await openai.chat.completions.create(
model="gpt-4",
messages=[{
"role": "system",
"content": AGENT_SYSTEM_PROMPT
}, {
"role": "user",
"content": f"Diagnose this pod issue: {json.dumps(context)}"
}],
response_format={"type": "json_object"}
)
diagnosis = json.loads(response.choices[0].message.content)
# Execute within guardrails
if diagnosis['action'] in ALLOWED_ACTIONS:
await execute_action(diagnosis)
logger.info(f"Executed: {diagnosis['action']}")
else:
await notify_human(diagnosis)The System Prompt
The system prompt defines the agentβs personality and constraints:
AGENT_SYSTEM_PROMPT = """
You are a Kubernetes SRE agent. You diagnose pod issues and recommend actions.
ALLOWED ACTIONS:
- restart_pod: Restart a single pod
- scale_deployment: Scale replicas (2-20 range only)
- cordon_node: Mark node as unschedulable
- create_alert: Send alert to ops team
NEVER:
- Delete namespaces or persistent volumes
- Modify RBAC or network policies
- Act on multiple pods simultaneously
Respond in JSON:
{
"diagnosis": "string",
"confidence": 0.0-1.0,
"action": "string",
"parameters": {},
"reasoning": "string"
}
"""Custom Resource Definition
Define a CRD to track agent decisions:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: agentdecisions.ai.example.com
spec:
group: ai.example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
observation:
type: string
diagnosis:
type: string
action:
type: string
confidence:
type: number
approved:
type: boolean
executed:
type: booleanProduction Considerations
- Rate limit LLM calls β cache common diagnoses, batch observations
- Fallback logic β if the LLM is unavailable, fall back to rule-based decisions
- Cost management β GPT-4 calls add up; use smaller models for common patterns
- Testing β replay historical incidents to validate agent behavior
- Observability β log every decision with full context for audit