AMA Recap: Scaling AI on Kubernetes

The workshop

On March 28, 2026, Packt Publishing hosted “Build & Scale AI Workloads on Kubernetes” — a 4.5-hour online workshop focused on what actually breaks when you run AI on Kubernetes, and how to prevent it before it becomes an incident.

The lineup included Shadab Hussain, Sandeep Raghuwanshi, Nicolas Vermandé, and Derek Ashmore — each covering different aspects of production AI infrastructure. The sessions walked through GPU scheduling, autoscaling patterns, observability for inference workloads, and a capstone incident simulation.

As a special part of the event, I joined for a live AMA (Ask Me Anything) focused on practical lessons from scaling AI/ML workloads on Kubernetes, designing GPU platforms, and takeaways from KubeCon Europe 2026 that platform teams can apply directly.

What the workshop covered

The workshop was structured around a simple premise: AI workloads do not behave like microservices. Inference services, agents, and GPU-backed runtimes create load patterns that can break autoscaling, hide failures, and burn budgets fast.

The core tracks:

GPU scheduling and node pool design — how to prevent contention and collapse under load
Autoscaling patterns for inference — HPA, VPA, KEDA, and Cluster Autoscaler working together
Observability for AI workloads — what to measure, where failures hide, and which metrics actually matter
Guardrails — quotas, policies, and secure-by-default deployment with OPA/Kyverno
Incident response — what happens when traffic spikes or agents behave unpredictably
Capstone scenario — a simulated production incident with traffic spikes and service failure

The AMA: What people asked

The AMA covered ground across GPU infrastructure, platform engineering, and the gap between demos and production. Here are the themes that came up most.

GPU multi-tenancy in practice

The most common question: how do you share GPUs across teams without everything breaking?

This is exactly what I presented at KubeCon Europe 2026 — multi-tenant GPUs on bare metal using a GitOps blueprint. The short answer: you need a combination of NVIDIA MIG, MPS, and time-slicing depending on your workload profiles, plus clear team contracts that define who gets what and when.

The mistake most teams make is treating GPU scheduling like CPU scheduling. It is not. GPU memory is not swappable, and over-commitment does not degrade gracefully — it crashes.

Autoscaling inference workloads

Several questions focused on why HPA alone is not enough for inference. The core issue is that inference latency and throughput do not correlate linearly with pod count. You need custom metrics — tokens per second, queue depth, time-to-first-token — feeding into KEDA or custom HPA metrics.

I shared the pattern we use: KEDA for queue-based scaling (watching inference request queues), HPA for latency-based scaling (P99 response time), and Cluster Autoscaler for node provisioning — all three working together with different reaction times.

The PoC-to-production gap

A recurring theme: teams that can run a model in a notebook but struggle to make it survive in production. This maps directly to the PoC-to-production gap I have been writing about — the distance between “it works on my laptop” and “it handles 10,000 requests per second with three-nines uptime.”

The biggest gaps are usually not technical. They are organizational: no clear ownership between ML teams and platform teams, no SLOs defined for inference, and no runbooks for GPU-specific failures.

KubeCon takeaways for platform teams

People wanted to know what was different at KubeCon Europe 2026 compared to previous years. The answer: AI has moved from “cool demos on stage” to “how do we actually run this in production without going bankrupt.”

The shift is from experimentation to industrialization — and that means platform engineering teams are now the bottleneck, not data scientists. If your platform cannot serve GPUs reliably, your AI strategy is blocked.

Hidden cost drivers

Several questions about why AI workloads are so expensive on Kubernetes. I pointed people to the hidden cost analysis — the biggest offenders are idle GPU time (reserved but unused), over-provisioned node pools, and missing spot/preemptible instance strategies for non-critical workloads.

One rule of thumb I shared: if your GPU utilization is below 60%, you are almost certainly wasting money. And most teams I work with are at 20-30%.

The capstone incident

The workshop ended with a simulated production incident — traffic spikes hitting an inference service, autoscaling reacting too slowly, and cascading failures across the cluster. Participants had to diagnose what broke, identify which signals mattered, and propose fixes.

This is exactly the kind of scenario I recommend teams practice before it happens in production. Having a first 90 days plan that includes incident simulation is the difference between a team that panics and a team that responds.

Key takeaways

If you missed the workshop, here is what I would want you to walk away with:

AI workloads are not microservices. Stop treating them like microservices. Different scheduling, different autoscaling, different observability.
GPU multi-tenancy requires explicit contracts. Define who gets what, enforce it with quotas and policies, and monitor utilization obsessively.
Autoscaling needs custom metrics. CPU and memory are not enough. Measure tokens/second, queue depth, and time-to-first-token.
The platform team is the bottleneck. If your platform cannot serve GPUs reliably, your AI strategy is blocked. Invest in platform engineering.
Practice incidents before they happen. Run chaos engineering exercises specific to GPU workloads. Know what happens when a GPU node goes down.

The lineup

The full speaker lineup delivered a comprehensive view of production AI on Kubernetes:

Shadab Hussain — AI workloads and agent mental models on Kubernetes
Sandeep Raghuwanshi — deploying AI services and GPU scheduling strategies
Nicolas Vermandé — observability and monitoring for inference workloads
Derek Ashmore — CI/CD, GitOps integration, and deployment patterns
Luca Berton — live AMA on GPU platforms, multi-tenancy, and KubeCon lessons

If the workshop topics resonated, here is the deep-dive reading list:

Multi-Tenant GPU Platform Operating Model — the full architecture
Autoscaling AI Inference on Kubernetes — KEDA, HPA, and token economics
Three-Layer Observability for AI — what to measure
Irreversible Architecture Decisions — what you cannot undo later
KubeCon 2026 Talk Recap: Packed Room — the KubeCon presentation

Want to discuss AI on Kubernetes for your organization? Book a consultation. For more on GPU infrastructure and platform engineering, explore the full AI on Kubernetes in Production series.

AMA Recap: Scaling AI on Kubernetes

The workshop

What the workshop covered