Skip to main content
πŸš€ Claude Code Bootcamp β€” May 30 5 hours from prompting to production. Build 10 real-world projects with AI-assisted development. Register Now
Red Hat OpenShift AI Gen AI Studio Playground demo at Tech Day Netherlands 2026
AI

Red Hat AI Model-as-a-Service with llm-d: Enterprise GenAI Inference Platform

How Red Hat's llm-d transforms LLM inference into a composable Kubernetes-native architecture β€” disaggregated serving, smart autoscaling on token throughput, Model-as-a-Service with subscription tiers, and the new Enterprise GenAI Inference Platform running on OpenShift and xKS/Linux.

LB
Luca Berton
Β· 2 min read

At Red Hat Tech Day Netherlands (June 2026), the second half of the vLLM deep dive shifted to what comes after optimized inference: llm-d for distributed serving and the full Model-as-a-Service platform that makes enterprise AI consumable.

This post covers the production platform layer β€” from the problems llm-d solves through the live demo of RHOAI’s Gen AI Studio with API key management and subscription tiers.

Red Hat OpenShift AI Playground - live model interaction

The Problem: LLM Routing Today

Current LLM serving has fundamental limitations:

  • Round-robin routing β€” No awareness of GPU load or cache state
  • No KV cache reuse β€” Every request starts cold
  • Static model selection β€” Cannot route based on request complexity
  • CPU/memory autoscaling β€” Meaningless for token-based workloads

The result: underutilized GPUs, poor latency under load, and architectures that do not scale with demand.

What is llm-d?

llm-d is an open-source Kubernetes-native distributed LLM inference system, jointly developed by:

  • Red Hat
  • Google
  • NVIDIA
  • Hugging Face

It extends vLLM with a disaggregated architecture where prefill, decode, and KV cache run as independent microservices.

llm-d Architecture

Request β†’ Inference Gateway β†’ Validation + Prompt Logging
                                    ↓
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                     β”‚                     β”‚
           Prefill            Scheduler              Decode
              β”‚                     β”‚                     β”‚
              └─────────────── KV Cache β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    ↓
                          Distributed Nodes/GPUs

How llm-d Delivers

  1. Microservice-based: Prefill, decode, and KV cache run independently
  2. Integrated observability: Metrics for each component separately
  3. Smart autoscaling: Based on token throughput and SLOs (not CPU/memory)
  4. Independent scaling: Scale prefill, decode, or cache phases separately
  5. Efficient GPU use: Better utilization across variable workloads

β€œllm-d transforms LLM inference into a composable Kubernetes-native architecture, making AI as manageable as any microservice.”

Model deployments showing llm-d distributed inference

Model-as-a-Service in OpenShift AI

The Platform Engineering vision for AI:

  • IT serves common models centrally β€” Generative AI focus, applicable to any model
  • Centralized pool of hardware β€” Shared GPU resources managed by ITOps
  • Models available through RHOAI console β€” Self-service for developers
  • Developers consume models, build AI applications β€” For end users (assistants) or to improve products
  • Shared resource business model β€” Keeps costs down across the organization

The MaaS Stack

| Layer | Managed By | |

Free 30-min AI & Cloud consultation

Book Now