Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
AI Agents as Kubernetes Operators
AI

Building AI Agents as Kubernetes Operators

Combine the Kubernetes operator pattern with LLM-powered decision making. Build agents that observe cluster state and take corrective action.

LB
Luca Berton
Β· 1 min read

The Kubernetes operator pattern β€” a custom controller that watches resources and reconciles state β€” is a natural fit for AI-powered agents. Instead of hardcoded reconciliation logic, the controller consults an LLM to decide what action to take.

Why Operators Make Good Agents

Kubernetes operators already implement the core agent loop:

  1. Watch β€” observe cluster events via the API server
  2. Analyze β€” determine if current state matches desired state
  3. Act β€” create, update, or delete resources to reconcile

Adding an LLM to step 2 transforms a rigid operator into an adaptive agent that can handle situations the developer did not anticipate.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AI Agent Operator                          β”‚
β”‚                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Watcher  β”‚β†’ β”‚ LLM      β”‚β†’ β”‚ Executor β”‚  β”‚
β”‚  β”‚ (K8s API)β”‚  β”‚ Reasoner β”‚  β”‚ (kubectl) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                     ↑                       β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚              β”‚ Context  β”‚                   β”‚
β”‚              β”‚ Gatherer β”‚                   β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Example

Using Python with kopf (Kubernetes Operator Framework):

import kopf
import openai
from kubernetes import client

@kopf.on.event('', 'v1', 'pods')
async def handle_pod_event(event, logger, **kwargs):
    pod = event['object']
    
    # Only act on unhealthy pods
    if not is_unhealthy(pod):
        return
    
    # Gather context
    context = {
        "pod_name": pod['metadata']['name'],
        "namespace": pod['metadata']['namespace'],
        "status": pod['status'],
        "events": get_recent_events(pod),
        "logs": get_pod_logs(pod, tail=50),
        "node_metrics": get_node_metrics(pod),
    }
    
    # Ask LLM for diagnosis and action
    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": AGENT_SYSTEM_PROMPT
        }, {
            "role": "user", 
            "content": f"Diagnose this pod issue: {json.dumps(context)}"
        }],
        response_format={"type": "json_object"}
    )
    
    diagnosis = json.loads(response.choices[0].message.content)
    
    # Execute within guardrails
    if diagnosis['action'] in ALLOWED_ACTIONS:
        await execute_action(diagnosis)
        logger.info(f"Executed: {diagnosis['action']}")
    else:
        await notify_human(diagnosis)

The System Prompt

The system prompt defines the agent’s personality and constraints:

AGENT_SYSTEM_PROMPT = """
You are a Kubernetes SRE agent. You diagnose pod issues and recommend actions.

ALLOWED ACTIONS:
- restart_pod: Restart a single pod
- scale_deployment: Scale replicas (2-20 range only)  
- cordon_node: Mark node as unschedulable
- create_alert: Send alert to ops team

NEVER:
- Delete namespaces or persistent volumes
- Modify RBAC or network policies
- Act on multiple pods simultaneously

Respond in JSON:
{
  "diagnosis": "string",
  "confidence": 0.0-1.0,
  "action": "string", 
  "parameters": {},
  "reasoning": "string"
}
"""

Custom Resource Definition

Define a CRD to track agent decisions:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: agentdecisions.ai.example.com
spec:
  group: ai.example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                observation:
                  type: string
                diagnosis:
                  type: string
                action:
                  type: string
                confidence:
                  type: number
                approved:
                  type: boolean
                executed:
                  type: boolean

Production Considerations

  1. Rate limit LLM calls β€” cache common diagnoses, batch observations
  2. Fallback logic β€” if the LLM is unavailable, fall back to rule-based decisions
  3. Cost management β€” GPT-4 calls add up; use smaller models for common patterns
  4. Testing β€” replay historical incidents to validate agent behavior
  5. Observability β€” log every decision with full context for audit

Free 30-min AI & Cloud consultation

Book Now