Deploying large language models in production requires more than just downloading weights and running inference. You need efficient memory management, request batching, and GPU utilization that justifies the hardware cost. vLLM on OpenShift AI gives you all of this with a production-grade serving stack.
Why vLLM
vLLM introduced PagedAttention, which manages the KV cache like virtual memory pages instead of allocating contiguous GPU memory blocks. This single innovation increased throughput by 2-4x compared to naive inference approaches.
Key capabilities:
- PagedAttention: efficient KV cache management, near-zero memory waste
- Continuous batching: new requests are added to running batches without waiting for the current batch to complete
- Tensor parallelism: split models across multiple GPUs automatically
- OpenAI-compatible API: drop-in replacement for OpenAI endpoints
- Quantization support: AWQ, GPTQ, and FP8 for running larger models on smaller GPUs
Prerequisites
Before deploying on OpenShift AI, you need:
- OpenShift 4.14+ with the OpenShift AI operator installed
- NVIDIA GPU Operator configured with your GPU nodes
- At least one node with NVIDIA A100, H100, or L40S GPUs
- A model stored in a PersistentVolume or accessible via S3-compatible storage
Deploying vLLM on OpenShift AI
Create the ServingRuntime
OpenShift AI uses ServingRuntime custom resources to define inference engines:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-runtime
namespace: my-ai-project
spec:
supportedModelFormats:
- name: vllm
version: "1"
autoSelect: true
multiModel: false
containers:
- name: kserve-container
image: quay.io/modh/vllm:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/mnt/models
- --tensor-parallel-size=2
- --max-model-len=4096
- --gpu-memory-utilization=0.9
- --dtype=float16
resources:
requests:
nvidia.com/gpu: 2
memory: 48Gi
cpu: 8
limits:
nvidia.com/gpu: 2
memory: 64Gi
cpu: 16
ports:
- containerPort: 8000
protocol: TCP
volumeMounts:
- name: model-storage
mountPath: /mnt/modelsCreate the InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-service
namespace: my-ai-project
annotations:
serving.kserve.io/deploymentMode: RawDeployment
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: vllm-runtime
storageUri: pvc://model-pvc/llama-3-8bVerify the Deployment
# Check pod status
oc get pods -n my-ai-project
# Check the inference service
oc get inferenceservice llama-3-service -n my-ai-project
# Test the endpoint
curl -X POST \
"https://llama-3-service-my-ai-project.apps.cluster.example.com/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b",
"prompt": "Explain Kubernetes in one paragraph:",
"max_tokens": 200,
"temperature": 0.7
}'GPU Memory Planning
Choosing the right configuration depends on your model size and GPU hardware:
| Model | Parameters | FP16 Memory | Minimum GPUs (A100 80GB) |
|---|---|---|---|
| Llama 3 8B | 8B | ~16GB | 1 |
| Mistral 7B | 7B | ~14GB | 1 |
| Llama 3 70B | 70B | ~140GB | 2 |
| Mixtral 8x7B | 47B | ~94GB | 2 |
| Llama 3.1 405B | 405B | ~810GB | 10+ |
For smaller GPUs like L40S (48GB) or A10G (24GB), use quantized models:
# AWQ quantized Llama 3 8B fits on a single 24GB GPU
--quantization awq --model TheBloke/Llama-3-8B-AWQPerformance Tuning
Continuous Batching Configuration
# Increase max concurrent sequences for higher throughput
--max-num-seqs=256
# Set maximum number of batched tokens
--max-num-batched-tokens=8192KV Cache Optimization
# Use 90% of GPU memory for KV cache (default is 0.9)
--gpu-memory-utilization=0.9
# Enable prefix caching for repeated prompts
--enable-prefix-cachingMonitoring with Prometheus
vLLM exposes Prometheus metrics out of the box. On OpenShift, connect these to the built-in monitoring stack:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
interval: 15sKey metrics to watch:
vllm:num_requests_runningβ active requestsvllm:num_requests_waitingβ queued requestsvllm:gpu_cache_usage_percβ KV cache utilizationvllm:avg_generation_throughput_toks_per_sβ tokens per second
Autoscaling
Configure horizontal pod autoscaling based on GPU utilization or request queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama-3-service-predictor
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "10"Automating Deployment with Ansible
For teams managing multiple models across environments, Ansible can automate the entire deployment lifecycle:
---
- name: Deploy vLLM model on OpenShift AI
hosts: localhost
vars:
model_name: llama-3-8b
namespace: my-ai-project
gpu_count: 2
model_pvc: model-pvc
tasks:
- name: Apply ServingRuntime
kubernetes.core.k8s:
state: present
src: templates/vllm-runtime.yaml
- name: Apply InferenceService
kubernetes.core.k8s:
state: present
src: templates/inference-service.yaml
- name: Wait for deployment
kubernetes.core.k8s_info:
api_version: serving.kserve.io/v1beta1
kind: InferenceService
name: "{{ model_name }}-service"
namespace: "{{ namespace }}"
register: isvc
until: isvc.resources[0].status.conditions |
selectattr('type', 'equalto', 'Ready') |
selectattr('status', 'equalto', 'True') | list | length > 0
retries: 30
delay: 10Final Thoughts
vLLM on OpenShift AI is the production-grade path for serving LLMs at scale. PagedAttention alone justifies the choice over naive serving approaches. Combined with OpenShiftβs GPU scheduling, monitoring, and autoscaling, you get an enterprise-ready inference platform.
Start with a single GPU node for development, validate your throughput requirements, then scale horizontally with the HPA configuration. The OpenAI-compatible API means your application code does not need to change as you scale the infrastructure underneath.

