Part 10 of a 10-part series on running AI workloads on Kubernetes in production.
The biggest risk is building too much too early
Teams starting their AI platform journey face a paradox: the decisions they make early are the hardest to reverse, but they have the least information to make them.
The solution is not to plan for six months. It is to be pragmatic: build the minimum path to production, learn from it, then expand.
Here is the 90-day roadmap.
Days 1-30: Define the platform contract
This is not about technology. It is about alignment.
Pick the initial use cases
Do not try to support every AI workload from day one. Pick 1-2 real use cases that:
- Have a clear business sponsor
- Are representative of your future workload mix
- Have a team ready to work with the platform team
- Are not so critical that failure is career-ending
Define tenants
Who will use this platform? Even if it is just one team initially, define the tenancy model as if there will be ten teams. It is much easier to relax isolation than to add it later.
Agree on success metrics
What does βdoneβ look like at 90 days? I suggest:
- One workload running in production on the platform
- Cost visibility at the tenant level
- Defined SLOs with automated measurement
- Clear ownership model between platform, SRE, and ML teams
Set non-negotiables
Before touching any technology, agree on:
- Security boundaries β what isolation level is required
- Quotas β what each tenant gets, guaranteed and burstable
- Environments β dev, staging, production separation
- Compliance requirements β audit logging, access controls, data residency
Days 31-60: Build the paved road
Now build β but only what the first use case needs.
Cluster baseline
- Node pools with GPU scheduling configured
- GPU partitioning strategy selected and implemented
- Network policies and security controls in place
- RBAC configured per the tenancy model
CI/CD path
- Model artifact registry (or container registry with model artifacts)
- Deployment pipeline from artifact to staging to production
- Rollback procedure tested and documented
- Promotion gates defined (automated tests, manual approval, or both)
Observability baseline
- Three-layer observability: platform signals, ML runtime signals, outcome signals
- Dashboards for the pilot team
- Alerting on SLO violations
- Cost tracking per tenant
Standard workload templates
- Helm charts or Kustomize overlays for common patterns (training job, inference deployment, batch pipeline)
- Resource request templates with sensible defaults
- Documentation that a new team can follow without hand-holding
Days 61-90: Production-proof one real workload
This is where the rubber meets the road. Take one real workload end-to-end and force the organization to deal with production reality.
Deploy to production
Not staging. Not a βproduction-likeβ environment. Actual production, serving actual users or processing actual data.
Test failure modes
- What happens when the GPU node goes down?
- What happens when inference latency exceeds the SLO?
- What happens when the model needs a rollback?
- What happens when the tenant exhausts their quota?
Resolve ownership gaps
Production will expose every gap in the team contract. Document what broke, who was confused, and what the fix is. Update the contract.
Measure costs
At 90 days, you should be able to answer:
- What does this workload cost per month?
- What is the cost per inference/training run?
- Where is the waste?
- What would it cost at 10x scale?
After 90 days: Expand
With one workload in production, you have:
- A proven platform path that works
- Real cost data to plan capacity
- A team contract tested under pressure
- Documentation and templates for the next team
Now expand: onboard the second team, add the second use case, refine the platform based on what you learned.
What not to do
- Do not build a multi-cloud AI platform in 90 days β start with one cluster
- Do not support every ML framework β pick one inference server and standardize
- Do not automate everything β automate the critical path, manual is fine for edge cases initially
- Do not skip the contract β technology without organizational alignment is just expensive experimentation
- Do not wait for perfect β a working platform with known limitations beats a perfect design that never ships
The series in review
Over this 10-part series, we covered the full stack of running AI workloads on Kubernetes in production:
- What Breaks First: PoC to Production
- Biggest Mistakes Scaling AI/ML
- Multi-Tenant GPU Platform Operating Model
- Autoscaling AI Inference
- AI Observability: Three Layers
- Hidden Cost Drivers
- Platform + SRE + ML Team Contract
- Architecture Decisions Hardest to Reverse
- KubeCon 2026: AI Industrialization
- AI Platform First 90 Days (this post)
The throughline: the technology works. What teams need is the operating model, the team contracts, and the discipline to start pragmatically and expand from production experience.
This is the final post in the series. Need help building your AI platform on Kubernetes? Book a free consultation. Check out my KubeCon 2026 talk slides on multi-tenant GPUs on bare metal.