Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
llm-d Kubernetes Native Distributed Inference
AI

llm-d: Kubernetes-Native Distributed Inference at Scale

Red Hat's llm-d project enters CNCF Sandbox β€” bringing disaggregated xPyD serving, tiered prefix caching, and multi-accelerator support to.

LB
Luca Berton
Β· 3 min read

At the CNCF Press Conference during KubeCon Europe 2026, Brian Stevens β€” SVP and AI CTO at Red Hat β€” announced llm-d as a new CNCF project. llm-d is a Kubernetes-native distributed inference engine designed to solve the problems that current inference serving stacks leave on the table.

This is not another wrapper around vLLM. It is a ground-up rethinking of how large language model inference should work in distributed Kubernetes environments.

What llm-d Does

llm-d addresses four specific challenges that plague production LLM inference:

1. Kubernetes-Native Inference

llm-d is built for Kubernetes from the start β€” not adapted from a standalone serving framework. It uses Kubernetes primitives for scheduling, scaling, and health management. No sidecar hacks, no custom operators bolted on after the fact.

2. Disaggregated (xPyD) Serving

This is the headline feature. Traditional inference serving couples the prefill phase (processing the input prompt) with the decode phase (generating tokens). llm-d separates these into independent, scalable components.

The xPyD architecture means you can run X prefill workers and Y decode workers independently:

  • Scale prefill workers for throughput (batch processing of long prompts)
  • Scale decode workers for latency (fast token generation)
  • Match hardware to workload characteristics (prefill is compute-bound, decode is memory-bound)

This disaggregation is critical for production workloads because prefill and decode have fundamentally different resource profiles. Running them on the same GPU wastes either compute or memory bandwidth depending on the traffic pattern.

3. Tiered Prefix Caching

llm-d implements multi-tier prefix caching:

  • L1 (GPU memory) β€” Hot KV cache entries for active conversations
  • L2 (CPU memory) β€” Warm cache for recently used prefixes
  • L3 (Distributed storage) β€” Cold cache for shared system prompts and common prefixes

This matters enormously for production deployments where many users share similar system prompts or conversation starters. Instead of recomputing the same prefix thousands of times, llm-d caches and reuses it across the cluster.

4. Multi-Accelerator Support

llm-d is not GPU-only. It supports multiple accelerator types including NVIDIA GPUs, AMD GPUs, and Intel Gaudi. This aligns with the Kubernetes AI Conformance initiative β€” workloads should be portable across accelerators without code changes.

Why This Matters

The current inference serving landscape is fragmented. vLLM, TGI, Triton Inference Server, and various custom solutions each solve parts of the problem but none addresses the full production lifecycle on Kubernetes.

llm-d’s contribution is architectural:

  • Disaggregation enables independent scaling β€” You are no longer constrained to scaling entire inference pods. Scale the bottleneck, not everything.
  • Prefix caching reduces costs β€” Shared caches across pods eliminate redundant computation. For agentic workloads with long system prompts, this can reduce compute costs by 40-60%.
  • Multi-accelerator portability β€” As GPU supply constraints continue, the ability to run inference on available accelerators (NVIDIA, AMD, Intel) without rewriting your serving stack is a strategic advantage.

How It Fits the Ecosystem

llm-d joins a growing CNCF AI ecosystem:

ProjectRole
llm-dDistributed inference serving
KyvernoPolicy governance for AI workloads
Gateway APIInference traffic routing
KARAI cluster conformance

Brian Stevens emphasized that llm-d is designed to be composable β€” it works alongside existing CNCF projects rather than replacing them.

Getting Started

The project is available at the CNCF Sandbox level, meaning it is early but actively developing:

# Clone the project
git clone https://github.com/llm-d/llm-d.git

# Deploy on Kubernetes
kubectl apply -f deploy/

For teams already running inference on Kubernetes, llm-d is worth evaluating if you are hitting scaling bottlenecks with monolithic serving architectures. The disaggregated approach requires more operational complexity but delivers significantly better resource utilization.

Free 30-min AI & Cloud consultation

Book Now