AI Platform First 90 Days on Kubernetes: A Pragmatic Roadmap

Part 10 of a 10-part series on running AI workloads on Kubernetes in production.

The biggest risk is building too much too early

Teams starting their AI platform journey face a paradox: the decisions they make early are the hardest to reverse, but they have the least information to make them.

The solution is not to plan for six months. It is to be pragmatic: build the minimum path to production, learn from it, then expand.

Here is the 90-day roadmap.

Days 1-30: Define the platform contract

This is not about technology. It is about alignment.

Pick the initial use cases

Do not try to support every AI workload from day one. Pick 1-2 real use cases that:

Have a clear business sponsor
Are representative of your future workload mix
Have a team ready to work with the platform team
Are not so critical that failure is career-ending

Define tenants

Who will use this platform? Even if it is just one team initially, define the tenancy model as if there will be ten teams. It is much easier to relax isolation than to add it later.

Agree on success metrics

What does “done” look like at 90 days? I suggest:

One workload running in production on the platform
Cost visibility at the tenant level
Defined SLOs with automated measurement
Clear ownership model between platform, SRE, and ML teams

Set non-negotiables

Before touching any technology, agree on:

Security boundaries — what isolation level is required
Quotas — what each tenant gets, guaranteed and burstable
Environments — dev, staging, production separation
Compliance requirements — audit logging, access controls, data residency

Days 31-60: Build the paved road

Now build — but only what the first use case needs.

Cluster baseline

Node pools with GPU scheduling configured
GPU partitioning strategy selected and implemented
Network policies and security controls in place
RBAC configured per the tenancy model

CI/CD path

Model artifact registry (or container registry with model artifacts)
Deployment pipeline from artifact to staging to production
Rollback procedure tested and documented
Promotion gates defined (automated tests, manual approval, or both)

Observability baseline

Three-layer observability: platform signals, ML runtime signals, outcome signals
Dashboards for the pilot team
Alerting on SLO violations
Cost tracking per tenant

Standard workload templates

Helm charts or Kustomize overlays for common patterns (training job, inference deployment, batch pipeline)
Resource request templates with sensible defaults
Documentation that a new team can follow without hand-holding

Days 61-90: Production-proof one real workload

This is where the rubber meets the road. Take one real workload end-to-end and force the organization to deal with production reality.

Deploy to production

Not staging. Not a “production-like” environment. Actual production, serving actual users or processing actual data.

Test failure modes

What happens when the GPU node goes down?
What happens when inference latency exceeds the SLO?
What happens when the model needs a rollback?
What happens when the tenant exhausts their quota?

Resolve ownership gaps

Production will expose every gap in the team contract. Document what broke, who was confused, and what the fix is. Update the contract.

Measure costs

At 90 days, you should be able to answer:

What does this workload cost per month?
What is the cost per inference/training run?
Where is the waste?
What would it cost at 10x scale?

After 90 days: Expand

With one workload in production, you have:

A proven platform path that works
Real cost data to plan capacity
A team contract tested under pressure
Documentation and templates for the next team

Now expand: onboard the second team, add the second use case, refine the platform based on what you learned.

What not to do

Do not build a multi-cloud AI platform in 90 days — start with one cluster
Do not support every ML framework — pick one inference server and standardize
Do not automate everything — automate the critical path, manual is fine for edge cases initially
Do not skip the contract — technology without organizational alignment is just expensive experimentation
Do not wait for perfect — a working platform with known limitations beats a perfect design that never ships

The series in review

Over this 10-part series, we covered the full stack of running AI workloads on Kubernetes in production:

The throughline: the technology works. What teams need is the operating model, the team contracts, and the discipline to start pragmatically and expand from production experience.

This is the final post in the series. Need help building your AI platform on Kubernetes? Book a free consultation. Check out my KubeCon 2026 talk slides on multi-tenant GPUs on bare metal.