This post previews my KubeCon + CloudNativeCon Europe 2026 talk. If you’re attending in Amsterdam on March 24 at 16:15 CET, come say hello at RAI Amsterdam — Hall 8, Room F. View session on Sched →
Every organisation deploying AI/ML workloads today faces the same tension: GPUs are the scarcest, most expensive resource in the cluster, yet most teams treat them as single-tenant luxuries. A single NVIDIA H200 card costs more than an entire rack of CPU nodes — and if only one team can use it at a time, you’re burning money.
The question isn’t whether to share GPUs across teams. It’s how to do it safely — without one team’s runaway training job starving another team’s real-time inference endpoint.
I’ve spent the past months building and operating a multi-tenant GPU platform on Red Hat OpenShift AI with NVIDIA KAI on G/H200 hardware. Along the way I collected hard-won lessons — things that worked brilliantly and things that failed spectacularly. This talk is my chance to share them so you don’t have to learn the hard way.
The presentation is structured around seven key lessons from running production GPU workloads across multiple teams. Here’s a taste of what to expect:
Kubernetes namespaces and ResourceQuota are a good start, but GPU scheduling has subtleties that CPU scheduling doesn’t. I’ll explain why you need layered isolation — combining quotas, priority classes, node pools, and taints — to truly protect tenants from each other.
NVIDIA Multi-Instance GPU (MIG) lets you partition a single GPU into isolated slices. It sounds perfect for multi-tenancy, but it comes with real trade-offs around NVLink, reconfiguration downtime, and large-model training. I’ll share how we designed heterogeneous node pools so the right workload lands on the right GPU configuration every time.
The default Kubernetes scheduler is GPU-unaware. NVIDIA KAI changes the game with topology-aware placement that understands NVLink domains, MIG slices, and GPU health. I’ll cover the wins — like a 30–40% throughput improvement on distributed training — and the surprises we didn’t expect.
This was our most painful lesson. Driver version mismatches can silently corrupt training results. Firmware updates require cold reboots. I’ll walk through our canary-based driver lifecycle strategy that avoids downtime and data corruption.
Standard Kubernetes metrics tell you almost nothing about GPU health. I’ll show which DCGM Exporter metrics actually matter and the alerts you should set up on day one — especially the one that saved us from silent data corruption.
Serving multiple models to multiple teams requires more than a basic ingress. I’ll discuss how we used per-team rate limits, weighted canary rollouts, and health-based routing to keep multi-tenant inference stable under pressure.
Multi-tenancy only works politically if teams can see what they’re spending. I’ll share the chargeback model we built — and how financial visibility changed team behaviour faster than any technical guardrail ever could.
This talk is designed for:
The session is rated Intermediate — you should be comfortable with Kubernetes concepts like namespaces, scheduling, and operators, but you don’t need prior GPU orchestration experience.
To give you a sense of scope, here’s the production stack the lessons are drawn from:
| Layer | Technology |
|---|---|
| Hardware | Dell PowerEdge servers with NVIDIA G/H200 GPUs |
| Kubernetes | Red Hat OpenShift 4.x |
| AI Platform | Red Hat OpenShift AI |
| GPU Scheduling | NVIDIA GPU Operator + KAI Scheduler |
| Inference Routing | Traefik Proxy |
| Observability | Prometheus + Grafana + DCGM Exporter |
| Storage | Dell PowerScale for model artifacts |
This isn’t a theoretical talk or a vendor demo — every recommendation comes from running real workloads on real hardware.
I’ll be presenting this material — with demos and war stories that don’t fit in a blog post — at KubeCon + CloudNativeCon Europe 2026:
| Detail | Information |
|---|---|
| Date | Tuesday, March 24, 2026 |
| Time | 16:15 – 16:45 CET |
| Room | RAI Amsterdam — Hall 8, Room F |
| Track | AI + ML |
I’m also speaking on a related topic at Red Hat Summit 2026 in Atlanta — details here.
If you can’t make it to Amsterdam, I’ll publish a detailed follow-up post after the conference with the full technical content, manifests, and diagrams from the presentation.
Have questions about GPU multi-tenancy on OpenShift AI? Reach out on LinkedIn or come find me at KubeCon — I’m always happy to talk shop!