Part 7 of a 10-part series on running AI workloads on Kubernetes in production.
The ownership problem
When inference latency spikes at 2 AM, who pages?
- Platform team says: “The cluster is healthy, must be the model”
- ML team says: “The model works fine in dev, must be the infrastructure”
- SRE says: “Nobody told me this workload existed”
This is the most common failure mode in AI production. Not a technical failure — an organizational one. The fix is not better tools. It is a clear contract.
The contract
Platform team owns the paved road
- Cluster infrastructure, node pools, GPU scheduling policy
- Security controls: RBAC, network policies, image scanning
- Tenancy model: namespaces, quotas, priorities
- Base images and standard deployment patterns
- Self-service onboarding for new teams and workloads
The platform team’s success metric: how fast can a new ML workload go from nothing to production-ready?
SRE owns reliability
- SLO definitions and measurement
- Incident response procedures and escalation paths
- Capacity planning and signals
- Production guardrails: circuit breakers, rate limits, rollback automation
- On-call rotation that includes AI-specific runbooks
SRE’s success metric: what percentage of time are production AI services meeting their SLOs?
ML team owns workload behavior
- Model quality: training, evaluation, validation
- Model versions and promotion decisions
- Inference requirements: latency targets, throughput expectations, resource profiles
- Data pipelines and feature management
- Experiment lifecycle and cleanup
ML team’s success metric: is the model delivering business value in production?
Where teams fail
Ownership by tool, not by outcome
“Platform team owns Kubernetes, ML team owns Kubeflow, SRE owns PagerDuty.” This creates gaps. When a Kubeflow pipeline fails because of a Kubernetes scheduling issue that SRE did not know about, nobody owns the problem.
Better: Own by outcome. Platform owns “workloads can deploy reliably.” SRE owns “production services meet SLOs.” ML owns “models deliver accurate predictions.”
Missing handoff points
The deployment pipeline crosses all three teams. If there is no defined handoff — where platform responsibility ends and ML responsibility begins — every deployment is a negotiation.
Define explicit handoff points:
- ML team produces a validated model artifact with defined resource requirements
- Platform team provides a deployment template that meets those requirements within cluster policy
- SRE validates the deployment meets production readiness criteria (observability, alerts, rollback)
- ML team promotes the model and monitors quality
- SRE monitors reliability and pages the right team based on signal type
No shared on-call
If the ML team is not on call for their models in production, they will never feel the pain of bad deployments, missing observability, or unclear runbooks. Shared on-call — where ML team members rotate into production support — creates the feedback loop that drives quality.
The interface document
Every AI workload in production should have a short interface document:
## Workload: fraud-detection-model-v3
### Owner
ML team: fraud-detection-squad
Platform contact: platform-gpu-team
SRE contact: sre-ai-oncall
### SLOs
- P99 inference latency: <200ms
- Availability: 99.9%
- Max cold start: 120 seconds
### Resources
- GPU: 1x A100 40GB per replica
- Replicas: 2 (min) — 6 (max)
- Model size: 12GB in GPU memory
### Alerts
- Latency > 200ms P99 over 5 min → pages SRE
- Accuracy drift > 5% → pages ML team
- GPU OOM → pages platform team
### Rollback
- Automated: revert to previous model version on accuracy drift
- Manual: ML team decision for quality issues, SRE for availabilityThis document takes 15 minutes to write and prevents hours of confusion during incidents.
Joint accountability, explicit boundaries
The goal is not to eliminate boundaries between teams. Boundaries are healthy — they create focus and accountability. The goal is to make those boundaries explicit so that nothing falls into the gaps.
When the contract is clear, teams move faster because they know exactly what they own, what they can expect from others, and where to go when something goes wrong.
Next: Architecture Decisions That Are Hardest to Reverse. Previous: Hidden Cost Drivers in AI Workloads. Need help defining your team contracts for AI production? Book a free consultation.