Part 2 of a 10-part series on running AI workloads on Kubernetes in production.
Mistake zero: treating AI like microservices
The biggest mistake is treating AI workloads like slightly bigger microservices. They are not.
Microservices are stateless, horizontally scalable, and relatively uniform. AI workloads are stateful (models, checkpoints, datasets), vertically demanding (GPUs, high-memory nodes), and wildly heterogeneous (a 7B parameter fine-tune looks nothing like a batch inference pipeline).
Teams that apply microservice patterns to AI workloads end up with expensive clusters, poor utilization, and constant friction between platform and data science teams.
The five mistakes that kill at scale
1. Underestimating GPU sharing complexity
GPU sharing is not like CPU sharing. You cannot just slice a GPU into arbitrary fractions and expect workloads to play nicely. MIG, MPS, and time-slicing each have trade-offs:
- MIG gives hardware isolation but requires specific GPU models and fixed partition sizes
- MPS enables concurrent access but shares fault domains
- Time-slicing is simple but provides no memory isolation
Most teams pick one without understanding the trade-offs, then discover the limitations in production under load.
2. Ignoring model artifact lifecycle
Models are not container images. They have versions, lineage, evaluation metrics, and approval workflows. Without a proper artifact registry and promotion pipeline, teams end up with:
- Model versions scattered across S3 buckets and local directories
- No audit trail for whatβs running in production
- Rollback procedures that involve βask the data scientist which checkpoint was goodβ
3. Neglecting data locality
Moving terabytes of training data across network boundaries is expensive and slow. Teams that design their storage strategy around convenience rather than data locality pay for it in training time, network costs, and pipeline reliability.
4. Experiment sprawl
Data scientists experiment. That is their job. But without guardrails, experiment sprawl consumes GPU hours, storage, and cluster capacity in ways that are invisible until the bill arrives. Every abandoned notebook server with a reserved GPU is money burning.
5. Optimizing for utilization before governance
This is the subtlest mistake. Teams see 30% GPU utilization and immediately try to pack more workloads onto the cluster. But without governance β quotas, priorities, tenancy boundaries, approval paths β higher utilization means more contention, more noisy-neighbor problems, and more political fights about who gets access.
The winning pattern is to standardize platform primitives early: quotas, base images, pipelines, observability, and approval paths. Then optimize utilization within those guardrails.
What works instead
From my experience building multi-tenant GPU platforms and running large-scale infrastructure:
- Treat the platform as a product β with defined consumers, an API, and an SLA
- Make experiment cleanup automatic β TTLs, resource limits, idle detection
- Version everything β models, data, configs, infrastructure
- Separate scheduling policy from scheduling implementation β so you can change tools without rewriting contracts
- Measure cost per team, per workload, per experiment β visibility drives behavior
The cost of getting it wrong
The cost is not just money (though it is a lot of money). The real cost is velocity. When the platform is unreliable, teams work around it. Shadow clusters appear. Manual processes multiply. Trust erodes. And the platform team spends all its time fighting fires instead of building capabilities.
Next in series: Building a Multi-Tenant GPU Platform: The Operating Model. Previous: What Breaks First: PoC to Production. Need help scaling your AI platform? Book a free consultation.