Red Hat AI Model-as-a-Service with llm-d

At Red Hat Tech Day Netherlands (June 2026), the second half of the vLLM deep dive shifted to what comes after optimized inference: llm-d for distributed serving and the full Model-as-a-Service platform that makes enterprise AI consumable.

This post covers the production platform layer — from the problems llm-d solves through the live demo of RHOAI’s Gen AI Studio with API key management and subscription tiers.

Red Hat OpenShift AI Playground - live model interaction

The Problem: LLM Routing Today

Current LLM serving has fundamental limitations:

Round-robin routing — No awareness of GPU load or cache state
No KV cache reuse — Every request starts cold
Static model selection — Cannot route based on request complexity
CPU/memory autoscaling — Meaningless for token-based workloads

The result: underutilized GPUs, poor latency under load, and architectures that do not scale with demand.

What is llm-d?

llm-d is an open-source Kubernetes-native distributed LLM inference system, jointly developed by:

Red Hat
Google
NVIDIA
Hugging Face

It extends vLLM with a disaggregated architecture where prefill, decode, and KV cache run as independent microservices.

llm-d Architecture

Request → Inference Gateway → Validation + Prompt Logging
                                    ↓
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
           Prefill            Scheduler              Decode
              │                     │                     │
              └─────────────── KV Cache ──────────────────┘
                                    ↓
                          Distributed Nodes/GPUs

How llm-d Delivers

Microservice-based: Prefill, decode, and KV cache run independently
Integrated observability: Metrics for each component separately
Smart autoscaling: Based on token throughput and SLOs (not CPU/memory)
Independent scaling: Scale prefill, decode, or cache phases separately
Efficient GPU use: Better utilization across variable workloads

“llm-d transforms LLM inference into a composable Kubernetes-native architecture, making AI as manageable as any microservice.”

Model deployments showing llm-d distributed inference

Model-as-a-Service in OpenShift AI

The Platform Engineering vision for AI:

IT serves common models centrally — Generative AI focus, applicable to any model
Centralized pool of hardware — Shared GPU resources managed by ITOps
Models available through RHOAI console — Self-service for developers
Developers consume models, build AI applications — For end users (assistants) or to improve products
Shared resource business model — Keeps costs down across the organization

The MaaS Stack

| Layer | Managed By | |

Red Hat AI Model-as-a-Service with llm-d

The Problem: LLM Routing Today

What is llm-d?

llm-d Architecture

How llm-d Delivers

Model-as-a-Service in OpenShift AI

The MaaS Stack

Related Articles

LinkedIn Has the Most AI Slop. That's Actually an Opportunity.

What 'Agent Engineering Platform' Actually Means for Production AI

The Spec Layer: Why AI Agents Need Structured Intent, Not Vibes

Google's AI Evolution: Maps, Photos, Chrome, and Project Genie