At Red Hat Tech Day Netherlands (June 2026), the second half of the vLLM deep dive shifted to what comes after optimized inference: llm-d for distributed serving and the full Model-as-a-Service platform that makes enterprise AI consumable.
This post covers the production platform layer β from the problems llm-d solves through the live demo of RHOAIβs Gen AI Studio with API key management and subscription tiers.

The Problem: LLM Routing Today
Current LLM serving has fundamental limitations:
- Round-robin routing β No awareness of GPU load or cache state
- No KV cache reuse β Every request starts cold
- Static model selection β Cannot route based on request complexity
- CPU/memory autoscaling β Meaningless for token-based workloads
The result: underutilized GPUs, poor latency under load, and architectures that do not scale with demand.
What is llm-d?
llm-d is an open-source Kubernetes-native distributed LLM inference system, jointly developed by:
- Red Hat
- NVIDIA
- Hugging Face
It extends vLLM with a disaggregated architecture where prefill, decode, and KV cache run as independent microservices.
llm-d Architecture
Request β Inference Gateway β Validation + Prompt Logging
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
Prefill Scheduler Decode
β β β
ββββββββββββββββ KV Cache βββββββββββββββββββ
β
Distributed Nodes/GPUsHow llm-d Delivers
- Microservice-based: Prefill, decode, and KV cache run independently
- Integrated observability: Metrics for each component separately
- Smart autoscaling: Based on token throughput and SLOs (not CPU/memory)
- Independent scaling: Scale prefill, decode, or cache phases separately
- Efficient GPU use: Better utilization across variable workloads
βllm-d transforms LLM inference into a composable Kubernetes-native architecture, making AI as manageable as any microservice.β

Model-as-a-Service in OpenShift AI
The Platform Engineering vision for AI:
- IT serves common models centrally β Generative AI focus, applicable to any model
- Centralized pool of hardware β Shared GPU resources managed by ITOps
- Models available through RHOAI console β Self-service for developers
- Developers consume models, build AI applications β For end users (assistants) or to improve products
- Shared resource business model β Keeps costs down across the organization
The MaaS Stack
| Layer | Managed By | |
