Platform + SRE + ML Teams: The Production

Part 7 of a 10-part series on running AI workloads on Kubernetes in production.

The ownership problem

When inference latency spikes at 2 AM, who pages?

Platform team says: “The cluster is healthy, must be the model”
ML team says: “The model works fine in dev, must be the infrastructure”
SRE says: “Nobody told me this workload existed”

This is the most common failure mode in AI production. Not a technical failure — an organizational one. The fix is not better tools. It is a clear contract.

The contract

Platform team owns the paved road

Cluster infrastructure, node pools, GPU scheduling policy
Security controls: RBAC, network policies, image scanning
Tenancy model: namespaces, quotas, priorities
Base images and standard deployment patterns
Self-service onboarding for new teams and workloads

The platform team’s success metric: how fast can a new ML workload go from nothing to production-ready?

SRE owns reliability

SLO definitions and measurement
Incident response procedures and escalation paths
Capacity planning and signals
Production guardrails: circuit breakers, rate limits, rollback automation
On-call rotation that includes AI-specific runbooks

SRE’s success metric: what percentage of time are production AI services meeting their SLOs?

ML team owns workload behavior

Model quality: training, evaluation, validation
Model versions and promotion decisions
Inference requirements: latency targets, throughput expectations, resource profiles
Data pipelines and feature management
Experiment lifecycle and cleanup

ML team’s success metric: is the model delivering business value in production?

Where teams fail

Ownership by tool, not by outcome

“Platform team owns Kubernetes, ML team owns Kubeflow, SRE owns PagerDuty.” This creates gaps. When a Kubeflow pipeline fails because of a Kubernetes scheduling issue that SRE did not know about, nobody owns the problem.

Better: Own by outcome. Platform owns “workloads can deploy reliably.” SRE owns “production services meet SLOs.” ML owns “models deliver accurate predictions.”

Missing handoff points

The deployment pipeline crosses all three teams. If there is no defined handoff — where platform responsibility ends and ML responsibility begins — every deployment is a negotiation.

Define explicit handoff points:

ML team produces a validated model artifact with defined resource requirements
Platform team provides a deployment template that meets those requirements within cluster policy
SRE validates the deployment meets production readiness criteria (observability, alerts, rollback)
ML team promotes the model and monitors quality
SRE monitors reliability and pages the right team based on signal type

No shared on-call

If the ML team is not on call for their models in production, they will never feel the pain of bad deployments, missing observability, or unclear runbooks. Shared on-call — where ML team members rotate into production support — creates the feedback loop that drives quality.

The interface document

Every AI workload in production should have a short interface document:

## Workload: fraud-detection-model-v3

### Owner
ML team: fraud-detection-squad
Platform contact: platform-gpu-team
SRE contact: sre-ai-oncall

### SLOs
- P99 inference latency: <200ms
- Availability: 99.9%
- Max cold start: 120 seconds

### Resources
- GPU: 1x A100 40GB per replica
- Replicas: 2 (min) — 6 (max)
- Model size: 12GB in GPU memory

### Alerts
- Latency > 200ms P99 over 5 min → pages SRE
- Accuracy drift > 5% → pages ML team
- GPU OOM → pages platform team

### Rollback
- Automated: revert to previous model version on accuracy drift
- Manual: ML team decision for quality issues, SRE for availability

This document takes 15 minutes to write and prevents hours of confusion during incidents.

Joint accountability, explicit boundaries

The goal is not to eliminate boundaries between teams. Boundaries are healthy — they create focus and accountability. The goal is to make those boundaries explicit so that nothing falls into the gaps.

When the contract is clear, teams move faster because they know exactly what they own, what they can expect from others, and where to go when something goes wrong.

Next: Architecture Decisions That Are Hardest to Reverse. Previous: Hidden Cost Drivers in AI Workloads. Need help defining your team contracts for AI production? Book a free consultation.