The Architecture Decisions That Are Hardest

Part 8 of a 10-part series on running AI workloads on Kubernetes in production.

Tools are replaceable. Architecture is not.

You can swap Prometheus for Datadog. You can migrate from Jenkins to Argo. You can replace your inference server with a better one. Tool changes are painful but achievable.

What you cannot easily reverse are the decisions that shape how teams consume the platform every day. These decisions get embedded into pipelines, team habits, governance processes, and organizational expectations. Unwinding them means unwinding everything built on top.

The five hardest decisions to reverse

1. Tenancy and isolation model

How you define tenants — team-level, project-level, environment-level — determines:

How resources are allocated and accounted for
How security boundaries are enforced
How teams interact with each other on shared infrastructure
How cost attribution works

If you start with a flat namespace model and later need strong isolation, you are migrating every team, every pipeline, every RBAC rule, and every monitoring configuration. If you start with over-isolation and later need sharing, you are fighting the security boundaries you built.

Get this right early: Start with one tenant model, but design the abstraction so you can refine boundaries without rebuilding the platform.

2. GPU partitioning and scheduling

How you partition GPUs — MIG, MPS, time-slicing, or dedicated — becomes embedded in every workload’s resource request, every scheduling rule, and every capacity planning model.

Switching from time-slicing to MIG later means:

Reconfiguring every GPU node
Updating every workload’s resource requests
Rewriting scheduling policies
Re-baselining all capacity and cost models

Get this right early: Understand your workload mix (training vs. inference, batch vs. interactive) and choose the partitioning strategy that fits the majority. Plan for exceptions rather than trying to support everything from day one.

3. The data plane

Storage, feature access, and artifact movement form the data plane. This includes:

Where models are stored and how they are served
Where training data lives and how pipelines access it
How features are computed, cached, and served
How checkpoints and experiment artifacts are managed

The data plane touches every workload, every pipeline, and every team. Changing the storage backend, the artifact registry, or the feature store later means migrating data, rewriting pipelines, and retraining teams.

Get this right early: Invest in a clear data architecture from day one. Even if you start simple, make the abstraction layers clean so you can evolve the implementation without breaking consumers.

4. CI/CD and model promotion

How models move from experiment to staging to production defines the developer experience:

Manual promotion vs. automated pipelines
Approval gates and who can override them
Rollback procedures and speed
Canary/shadow deployment patterns

Teams build muscle memory around these workflows. Changing them later means retraining everyone, migrating every pipeline, and rebuilding trust in the promotion process.

5. Identity, policy, and audit architecture

Authentication, authorization, policy enforcement, and audit logging are the foundation of governance:

How users authenticate to the platform
How access is controlled (RBAC, ABAC, OPA)
What policies are enforced at admission time
What audit trail exists for compliance

These decisions are especially hard to reverse in regulated industries. Financial services, healthcare, and government environments require audit trails that are continuous and complete. If you bolt on governance later, you have a gap in your audit history that regulators will question.

How to make these decisions

For each of these five areas, I recommend:

Document the decision — not just what you chose, but what you considered and why you rejected alternatives
Define the blast radius — if this decision is wrong, what has to change
Build abstraction layers — even if the implementation is simple, keep the interface clean so you can swap implementations later
Get cross-team input — these decisions affect platform, SRE, ML, and security teams. No single team should make them alone
Time-box the decision — do not spend months deciding. Make the best decision you can with current information and invest in the ability to evolve

The meta-lesson

The hardest part of architecture is knowing which decisions matter. Most decisions can be reversed cheaply — choose a tool, try it, swap it if it does not work. But the five listed above create structural constraints that compound over time. Invest your best thinking, your best engineers, and your cross-functional alignment effort on these five areas. For everything else, move fast and iterate.

Next: KubeCon 2026: AI on Kubernetes Has Reached Industrialization. Previous: Platform + SRE + ML Team Contract. Need help with AI platform architecture decisions? Book a free consultation.