Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Scaling AI ML workloads on Kubernetes mistakes β€” GPU sharing, governance, experiment sprawl
Platform Engineering

The Biggest Mistakes Teams Make Scaling AI/ML

Treating AI workloads like bigger microservices is the fastest way to fail. GPU sharing, experiment sprawl, and premature optimization mistakes.

LB
Luca Berton
Β· 3 min read

Part 2 of a 10-part series on running AI workloads on Kubernetes in production.

Mistake zero: treating AI like microservices

The biggest mistake is treating AI workloads like slightly bigger microservices. They are not.

Microservices are stateless, horizontally scalable, and relatively uniform. AI workloads are stateful (models, checkpoints, datasets), vertically demanding (GPUs, high-memory nodes), and wildly heterogeneous (a 7B parameter fine-tune looks nothing like a batch inference pipeline).

Teams that apply microservice patterns to AI workloads end up with expensive clusters, poor utilization, and constant friction between platform and data science teams.

The five mistakes that kill at scale

1. Underestimating GPU sharing complexity

GPU sharing is not like CPU sharing. You cannot just slice a GPU into arbitrary fractions and expect workloads to play nicely. MIG, MPS, and time-slicing each have trade-offs:

  • MIG gives hardware isolation but requires specific GPU models and fixed partition sizes
  • MPS enables concurrent access but shares fault domains
  • Time-slicing is simple but provides no memory isolation

Most teams pick one without understanding the trade-offs, then discover the limitations in production under load.

2. Ignoring model artifact lifecycle

Models are not container images. They have versions, lineage, evaluation metrics, and approval workflows. Without a proper artifact registry and promotion pipeline, teams end up with:

  • Model versions scattered across S3 buckets and local directories
  • No audit trail for what’s running in production
  • Rollback procedures that involve β€œask the data scientist which checkpoint was good”

3. Neglecting data locality

Moving terabytes of training data across network boundaries is expensive and slow. Teams that design their storage strategy around convenience rather than data locality pay for it in training time, network costs, and pipeline reliability.

4. Experiment sprawl

Data scientists experiment. That is their job. But without guardrails, experiment sprawl consumes GPU hours, storage, and cluster capacity in ways that are invisible until the bill arrives. Every abandoned notebook server with a reserved GPU is money burning.

5. Optimizing for utilization before governance

This is the subtlest mistake. Teams see 30% GPU utilization and immediately try to pack more workloads onto the cluster. But without governance β€” quotas, priorities, tenancy boundaries, approval paths β€” higher utilization means more contention, more noisy-neighbor problems, and more political fights about who gets access.

The winning pattern is to standardize platform primitives early: quotas, base images, pipelines, observability, and approval paths. Then optimize utilization within those guardrails.

What works instead

From my experience building multi-tenant GPU platforms and running large-scale infrastructure:

  1. Treat the platform as a product β€” with defined consumers, an API, and an SLA
  2. Make experiment cleanup automatic β€” TTLs, resource limits, idle detection
  3. Version everything β€” models, data, configs, infrastructure
  4. Separate scheduling policy from scheduling implementation β€” so you can change tools without rewriting contracts
  5. Measure cost per team, per workload, per experiment β€” visibility drives behavior

The cost of getting it wrong

The cost is not just money (though it is a lot of money). The real cost is velocity. When the platform is unreliable, teams work around it. Shadow clusters appear. Manual processes multiply. Trust erodes. And the platform team spends all its time fighting fires instead of building capabilities.


Next in series: Building a Multi-Tenant GPU Platform: The Operating Model. Previous: What Breaks First: PoC to Production. Need help scaling your AI platform? Book a free consultation.

Free 30-min AI & Cloud consultation

Book Now