Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
llm-d CNCF Kubernetes Distributed Inference
AI

llm-d Joins the CNCF: Kubernetes-Native

llm-d is the CNCF's new sandbox project for distributed LLM inference on Kubernetes. Prefill/decode disaggregation, KV cache routing, and.

LB
Luca Berton
Β· 3 min read

The CNCF just accepted llm-d as a sandbox project, and it might be the most important AI infrastructure project of 2026. Built by IBM Research, Google, Red Hat, CoreWeave, and NVIDIA, llm-d turns Kubernetes into a state-of-the-art LLM inference platform.

I saw this announced at KubeCon Europe 2026 and immediately understood why it matters.

What Problem Does llm-d Solve?

LLM inference is not like serving a REST API. It is:

  • Highly stateful β€” KV caches must be preserved across requests for efficiency
  • Latency-sensitive β€” users wait in real time for token generation
  • Asymmetric β€” prompt processing (prefill) is compute-heavy; token generation (decode) is memory-bound
  • Expensive β€” GPU utilization directly translates to cost per token

Traditional Kubernetes load balancing (round-robin, least-connections) ignores all of this. It sends requests to random pods, fragmenting KV caches, misaligning workloads with hardware, and causing unpredictable latency.

llm-d fixes this by making Kubernetes inference-aware.

Key Capabilities

1. Prefill/Decode Disaggregation

The breakthrough architecture: separate prompt processing from token generation into independently scalable pods.

Why this matters:

  • Prefill is compute-bound (needs GPU FLOPS)
  • Decode is memory-bound (needs GPU bandwidth)
  • Same hardware cannot optimally serve both
  • Disaggregation lets you scale each phase independently
Request β†’ [Prefill Pods] β†’ KV Cache Transfer β†’ [Decode Pods] β†’ Response
           (GPU compute)                        (GPU memory)

This is the same architecture that Mistral AI uses internally. They are contributing a DisaggregatedSet operator for LeaderWorkerSet (LWS) back to the project.

2. Inference-Aware Traffic Routing

llm-d implements the Kubernetes Gateway API Inference Extension (GAIE) with an Endpoint Picker (EPP) that understands:

  • Prefix cache locality β€” route requests to pods that already have relevant KV cache entries
  • Model readiness β€” not just container health, but model loading state
  • Queue depth β€” actual inference queue, not just TCP connections
  • Hardware topology β€” route to pods on matching accelerator types

3. Hierarchical KV Cache Management

KV cache offloading across multiple tiers:

GPU HBM (fastest) β†’ CPU RAM β†’ NVMe SSD β†’ Network Storage (cheapest)

Hot caches stay on GPU. Warm caches spill to CPU. Cold caches go to disk. This maximizes GPU utilization while keeping frequently used contexts accessible.

4. Any Model, Any Accelerator, Any Cloud

llm-d is hardware-agnostic by design:

  • NVIDIA GPUs (A100, H100, B200)
  • AMD GPUs (MI300X)
  • Intel GPUs (Gaudi)
  • Google TPUs
  • Habana accelerators

The routing layer adapts request placement based on hardware characteristics β€” a request that benefits from high memory bandwidth goes to different hardware than one that needs compute throughput.

Architecture on Kubernetes

# llm-d uses LeaderWorkerSet for multi-node inference
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llm-inference
spec:
  replicas: 3
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
          - name: prefill-engine
            image: ghcr.io/llm-d/prefill:latest
            resources:
              limits:
                nvidia.com/gpu: "4"
    workerTemplate:
      spec:
        containers:
          - name: decode-engine
            image: ghcr.io/llm-d/decode:latest
            resources:
              limits:
                nvidia.com/gpu: "2"

Who Is Behind It

The founding consortium is impressive:

  • IBM Research β€” core architecture and KV cache management
  • Google Cloud β€” Gateway API integration and GKE optimization
  • Red Hat β€” OpenShift integration and operator development
  • CoreWeave β€” bare-metal GPU infrastructure expertise
  • NVIDIA β€” hardware optimization and accelerator support
  • AMD, Cisco, Intel, Lambda β€” multi-accelerator support
  • Hugging Face β€” model ecosystem integration
  • Mistral AI β€” disaggregated serving operator contributions
  • UC Berkeley and University of Chicago β€” research

How It Relates to Existing CNCF Projects

llm-d bridges the gap between:

  • KServe (high-level model serving control plane) and
  • vLLM (low-level inference engine)

It also integrates with the Kubernetes AI Conformance Program to ensure disaggregated serving is interoperable across platforms.

My Prediction

llm-d will become the standard way to serve LLMs on Kubernetes within 18 months. The combination of CNCF governance, multi-vendor backing, and solving a real technical problem (inference-aware scheduling) makes it inevitable.

For platform teams currently running vLLM behind basic Kubernetes services: watch this project closely. The efficiency gains from prefill/decode disaggregation and cache-aware routing are substantial β€” I have seen 2-3x improvement in tokens per second per GPU in similar architectures.

About the Author

I am Luca Berton, AI and Cloud Advisor. I presented on GPU scheduling at KubeCon EU 2026 and help enterprises deploy inference infrastructure. Book a consultation to design your LLM serving architecture.

Free 30-min AI & Cloud consultation

Book Now