Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI with NVIDIA KAI (G/H200)

This post previews my KubeCon + CloudNativeCon Europe 2026 talk. If you’re attending in Amsterdam on March 24 at 16:15 CET, come say hello at RAI Amsterdam — Hall 8, Room F. View session on Sched →

Why I’m Giving This Talk

Every organisation deploying AI/ML workloads today faces the same tension: GPUs are the scarcest, most expensive resource in the cluster, yet most teams treat them as single-tenant luxuries. A single NVIDIA H200 card costs more than an entire rack of CPU nodes — and if only one team can use it at a time, you’re burning money.

The question isn’t whether to share GPUs across teams. It’s how to do it safely — without one team’s runaway training job starving another team’s real-time inference endpoint.

I’ve spent the past months building and operating a multi-tenant GPU platform on Red Hat OpenShift AI with NVIDIA KAI on G/H200 hardware. Along the way I collected hard-won lessons — things that worked brilliantly and things that failed spectacularly. This talk is my chance to share them so you don’t have to learn the hard way.

What I’ll Cover

The presentation is structured around seven key lessons from running production GPU workloads across multiple teams. Here’s a taste of what to expect:

1. Why Namespace Isolation Alone Isn’t Enough

Kubernetes namespaces and ResourceQuota are a good start, but GPU scheduling has subtleties that CPU scheduling doesn’t. I’ll explain why you need layered isolation — combining quotas, priority classes, node pools, and taints — to truly protect tenants from each other.

2. MIG vs. Full-GPU: Choosing the Right Topology

NVIDIA Multi-Instance GPU (MIG) lets you partition a single GPU into isolated slices. It sounds perfect for multi-tenancy, but it comes with real trade-offs around NVLink, reconfiguration downtime, and large-model training. I’ll share how we designed heterogeneous node pools so the right workload lands on the right GPU configuration every time.

3. Topology-Aware Scheduling with NVIDIA KAI

The default Kubernetes scheduler is GPU-unaware. NVIDIA KAI changes the game with topology-aware placement that understands NVLink domains, MIG slices, and GPU health. I’ll cover the wins — like a 30–40% throughput improvement on distributed training — and the surprises we didn’t expect.

4. The GPU Driver Lifecycle

This was our most painful lesson. Driver version mismatches can silently corrupt training results. Firmware updates require cold reboots. I’ll walk through our canary-based driver lifecycle strategy that avoids downtime and data corruption.

5. GPU-Specific Observability

Standard Kubernetes metrics tell you almost nothing about GPU health. I’ll show which DCGM Exporter metrics actually matter and the alerts you should set up on day one — especially the one that saved us from silent data corruption.

6. Intelligent Inference Routing

Serving multiple models to multiple teams requires more than a basic ingress. I’ll discuss how we used per-team rate limits, weighted canary rollouts, and health-based routing to keep multi-tenant inference stable under pressure.

7. Making GPU Costs Visible

Multi-tenancy only works politically if teams can see what they’re spending. I’ll share the chargeback model we built — and how financial visibility changed team behaviour faster than any technical guardrail ever could.

Who Should Attend

This talk is designed for:

Platform Engineers building or operating GPU infrastructure on Kubernetes
AI/ML Engineers who want to understand the platform layer beneath their training jobs
Engineering Managers evaluating multi-tenant GPU strategies for their organisations
Anyone curious about what it takes to run production GPU workloads safely at scale

The session is rated Intermediate — you should be comfortable with Kubernetes concepts like namespaces, scheduling, and operators, but you don’t need prior GPU orchestration experience.

The Stack Behind the Talk

To give you a sense of scope, here’s the production stack the lessons are drawn from:

Layer	Technology
Hardware	Dell PowerEdge servers with NVIDIA G/H200 GPUs
Kubernetes	Red Hat OpenShift 4.x
AI Platform	Red Hat OpenShift AI
GPU Scheduling	NVIDIA GPU Operator + KAI Scheduler
Inference Routing	Traefik Proxy
Observability	Prometheus + Grafana + DCGM Exporter
Storage	Dell PowerScale for model artifacts

This isn’t a theoretical talk or a vendor demo — every recommendation comes from running real workloads on real hardware.

See It Live

I’ll be presenting this material — with demos and war stories that don’t fit in a blog post — at KubeCon + CloudNativeCon Europe 2026:

Detail	Information
Date	Tuesday, March 24, 2026
Time	16:15 – 16:45 CET
Room	RAI Amsterdam — Hall 8, Room F
Track	AI + ML

👉 View Session on Sched

I’m also speaking on a related topic at Red Hat Summit 2026 in Atlanta — details here.

If you can’t make it to Amsterdam, I’ll publish a detailed follow-up post after the conference with the full technical content, manifests, and diagrams from the presentation.

📢 Stay Updated

📺 YouTube - Luca Berton
💼 LinkedIn - Luca Berton

Have questions about GPU multi-tenancy on OpenShift AI? Reach out on LinkedIn or come find me at KubeCon — I’m always happy to talk shop!