Skip to main content
🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
AI

Running LLMs on OpenShift AI: A Complete Deployment Guide

Luca Berton 1 min read
#openshift#ai#llm#vllm#gpu

\n## 🚀 LLMs on OpenShift AI

OpenShift AI provides an enterprise-grade platform for deploying large language models. Here’s the complete guide from model selection to production serving.

Platform Setup

Prerequisites

# Install OpenShift AI operator
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
spec:
  channel: stable
  name: rhods-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

# Install NVIDIA GPU Operator
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: v24.9
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

Model Serving with vLLM

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: granite-34b
  namespace: ai-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-runtime
      storageUri: s3://models/granite-34b-code-instruct
    resources:
      limits:
        nvidia.com/gpu: "2"
        memory: "80Gi"
      requests:
        cpu: "8"
        memory: "64Gi"

Custom vLLM Runtime

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-runtime
spec:
  containers:
  - name: vllm
    image: quay.io/modh/vllm:latest
    args:
    - "--model=/mnt/models"
    - "--max-model-len=8192"
    - "--tensor-parallel-size=2"
    - "--gpu-memory-utilization=0.9"
    - "--enable-chunked-prefill"
    ports:
    - containerPort: 8000
      protocol: TCP
    volumeMounts:
    - name: shm
      mountPath: /dev/shm
  volumes:
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: 12Gi
  supportedModelFormats:
  - name: vLLM
    autoSelect: true

Model Selection Guide

ModelSizeGPU RequirementBest For
Granite 8B8B1x A100 40GBCode generation, general tasks
Granite 34B34B2x A100 80GBComplex reasoning, RAG
Llama 3.1 70B70B4x A100 80GBMaximum capability
Mistral 7B7B1x T4/A10Cost-effective inference

Autoscaling Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: granite-34b-predictor
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Production Checklist

  • GPU node pools with proper taints/tolerations
  • Model stored in S3/ODF with fast retrieval
  • vLLM configured with tensor parallelism for large models
  • HPA based on queue depth (not CPU)
  • Prometheus monitoring for token throughput and latency
  • Rate limiting per user/team
  • API key authentication
  • Response caching for common queries

Deploying LLMs on OpenShift? I help organizations build production AI serving platforms. Get in touch.\n

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut