Part 8 of a 10-part series on running AI workloads on Kubernetes in production.
Tools are replaceable. Architecture is not.
You can swap Prometheus for Datadog. You can migrate from Jenkins to Argo. You can replace your inference server with a better one. Tool changes are painful but achievable.
What you cannot easily reverse are the decisions that shape how teams consume the platform every day. These decisions get embedded into pipelines, team habits, governance processes, and organizational expectations. Unwinding them means unwinding everything built on top.
The five hardest decisions to reverse
1. Tenancy and isolation model
How you define tenants β team-level, project-level, environment-level β determines:
- How resources are allocated and accounted for
- How security boundaries are enforced
- How teams interact with each other on shared infrastructure
- How cost attribution works
If you start with a flat namespace model and later need strong isolation, you are migrating every team, every pipeline, every RBAC rule, and every monitoring configuration. If you start with over-isolation and later need sharing, you are fighting the security boundaries you built.
Get this right early: Start with one tenant model, but design the abstraction so you can refine boundaries without rebuilding the platform.
2. GPU partitioning and scheduling
How you partition GPUs β MIG, MPS, time-slicing, or dedicated β becomes embedded in every workloadβs resource request, every scheduling rule, and every capacity planning model.
Switching from time-slicing to MIG later means:
- Reconfiguring every GPU node
- Updating every workloadβs resource requests
- Rewriting scheduling policies
- Re-baselining all capacity and cost models
Get this right early: Understand your workload mix (training vs. inference, batch vs. interactive) and choose the partitioning strategy that fits the majority. Plan for exceptions rather than trying to support everything from day one.
3. The data plane
Storage, feature access, and artifact movement form the data plane. This includes:
- Where models are stored and how they are served
- Where training data lives and how pipelines access it
- How features are computed, cached, and served
- How checkpoints and experiment artifacts are managed
The data plane touches every workload, every pipeline, and every team. Changing the storage backend, the artifact registry, or the feature store later means migrating data, rewriting pipelines, and retraining teams.
Get this right early: Invest in a clear data architecture from day one. Even if you start simple, make the abstraction layers clean so you can evolve the implementation without breaking consumers.
4. CI/CD and model promotion
How models move from experiment to staging to production defines the developer experience:
- Manual promotion vs. automated pipelines
- Approval gates and who can override them
- Rollback procedures and speed
- Canary/shadow deployment patterns
Teams build muscle memory around these workflows. Changing them later means retraining everyone, migrating every pipeline, and rebuilding trust in the promotion process.
5. Identity, policy, and audit architecture
Authentication, authorization, policy enforcement, and audit logging are the foundation of governance:
- How users authenticate to the platform
- How access is controlled (RBAC, ABAC, OPA)
- What policies are enforced at admission time
- What audit trail exists for compliance
These decisions are especially hard to reverse in regulated industries. Financial services, healthcare, and government environments require audit trails that are continuous and complete. If you bolt on governance later, you have a gap in your audit history that regulators will question.
How to make these decisions
For each of these five areas, I recommend:
- Document the decision β not just what you chose, but what you considered and why you rejected alternatives
- Define the blast radius β if this decision is wrong, what has to change
- Build abstraction layers β even if the implementation is simple, keep the interface clean so you can swap implementations later
- Get cross-team input β these decisions affect platform, SRE, ML, and security teams. No single team should make them alone
- Time-box the decision β do not spend months deciding. Make the best decision you can with current information and invest in the ability to evolve
The meta-lesson
The hardest part of architecture is knowing which decisions matter. Most decisions can be reversed cheaply β choose a tool, try it, swap it if it does not work. But the five listed above create structural constraints that compound over time. Invest your best thinking, your best engineers, and your cross-functional alignment effort on these five areas. For everything else, move fast and iterate.
Next: KubeCon 2026: AI on Kubernetes Has Reached Industrialization. Previous: Platform + SRE + ML Team Contract. Need help with AI platform architecture decisions? Book a free consultation.