At the CNCF Press Conference during KubeCon Europe 2026, Brian Stevens β SVP and AI CTO at Red Hat β announced llm-d as a new CNCF project. llm-d is a Kubernetes-native distributed inference engine designed to solve the problems that current inference serving stacks leave on the table.
This is not another wrapper around vLLM. It is a ground-up rethinking of how large language model inference should work in distributed Kubernetes environments.
What llm-d Does
llm-d addresses four specific challenges that plague production LLM inference:
1. Kubernetes-Native Inference
llm-d is built for Kubernetes from the start β not adapted from a standalone serving framework. It uses Kubernetes primitives for scheduling, scaling, and health management. No sidecar hacks, no custom operators bolted on after the fact.
2. Disaggregated (xPyD) Serving
This is the headline feature. Traditional inference serving couples the prefill phase (processing the input prompt) with the decode phase (generating tokens). llm-d separates these into independent, scalable components.
The xPyD architecture means you can run X prefill workers and Y decode workers independently:
- Scale prefill workers for throughput (batch processing of long prompts)
- Scale decode workers for latency (fast token generation)
- Match hardware to workload characteristics (prefill is compute-bound, decode is memory-bound)
This disaggregation is critical for production workloads because prefill and decode have fundamentally different resource profiles. Running them on the same GPU wastes either compute or memory bandwidth depending on the traffic pattern.
3. Tiered Prefix Caching
llm-d implements multi-tier prefix caching:
- L1 (GPU memory) β Hot KV cache entries for active conversations
- L2 (CPU memory) β Warm cache for recently used prefixes
- L3 (Distributed storage) β Cold cache for shared system prompts and common prefixes
This matters enormously for production deployments where many users share similar system prompts or conversation starters. Instead of recomputing the same prefix thousands of times, llm-d caches and reuses it across the cluster.
4. Multi-Accelerator Support
llm-d is not GPU-only. It supports multiple accelerator types including NVIDIA GPUs, AMD GPUs, and Intel Gaudi. This aligns with the Kubernetes AI Conformance initiative β workloads should be portable across accelerators without code changes.
Why This Matters
The current inference serving landscape is fragmented. vLLM, TGI, Triton Inference Server, and various custom solutions each solve parts of the problem but none addresses the full production lifecycle on Kubernetes.
llm-dβs contribution is architectural:
- Disaggregation enables independent scaling β You are no longer constrained to scaling entire inference pods. Scale the bottleneck, not everything.
- Prefix caching reduces costs β Shared caches across pods eliminate redundant computation. For agentic workloads with long system prompts, this can reduce compute costs by 40-60%.
- Multi-accelerator portability β As GPU supply constraints continue, the ability to run inference on available accelerators (NVIDIA, AMD, Intel) without rewriting your serving stack is a strategic advantage.
How It Fits the Ecosystem
llm-d joins a growing CNCF AI ecosystem:
| Project | Role |
|---|---|
| llm-d | Distributed inference serving |
| Kyverno | Policy governance for AI workloads |
| Gateway API | Inference traffic routing |
| KAR | AI cluster conformance |
Brian Stevens emphasized that llm-d is designed to be composable β it works alongside existing CNCF projects rather than replacing them.
Getting Started
The project is available at the CNCF Sandbox level, meaning it is early but actively developing:
# Clone the project
git clone https://github.com/llm-d/llm-d.git
# Deploy on Kubernetes
kubectl apply -f deploy/For teams already running inference on Kubernetes, llm-d is worth evaluating if you are hitting scaling bottlenecks with monolithic serving architectures. The disaggregated approach requires more operational complexity but delivers significantly better resource utilization.