🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

vLLM Inference Optimization on RHEL AI

Luca Berton
#rhel-ai#vllm#inference#openai-api#tensor-parallelism#latency-optimization#model-serving#production-deployment

📘 Book Reference: This article is based on Chapter 3: Core Components of Practical RHEL AI, covering vLLM deployment and optimization for enterprise inference workloads.

Introduction

When it comes to serving large language models in production, vLLM stands out as the inference engine of choice for RHEL AI. Its innovative PagedAttention algorithm delivers exceptional throughput while maintaining low latency—critical for enterprise applications.

Practical RHEL AI dedicates significant coverage to vLLM configuration, optimization, and production deployment patterns that I’ll share in this article.

Why vLLM for Enterprise Inference?

Performance Comparison

Inference EngineThroughputLatency (P95)Memory Efficiency
vLLM24x baseline~40ms95%
Text Generation Inference12x baseline~60ms85%
Native Transformers1x baseline~200ms60%

Key Features

Installing vLLM on RHEL AI

# vLLM is included in RHEL AI
sudo dnf install -y rhel-ai-vllm

# Verify installation
vllm --version

# Check GPU availability
python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"

Basic Model Serving

Starting the vLLM Server

# Serve Granite model with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-3b-code-instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

Client Integration

from openai import OpenAI

# Connect to vLLM server (OpenAI-compatible)
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require API key by default
)

# Generate completion
response = client.chat.completions.create(
    model="ibm-granite/granite-3b-code-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RHEL AI in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Advanced Configuration

Multi-GPU Tensor Parallelism

For larger models, distribute across multiple GPUs:

# Serve 7B model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-7b-instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096

Configuration File

# vllm-config.yaml
model: ibm-granite/granite-7b-instruct
tensor_parallel_size: 2
gpu_memory_utilization: 0.9
max_model_len: 8192
dtype: float16
quantization: null
enforce_eager: false
max_num_seqs: 256
max_num_batched_tokens: 32768
# Start with config
vllm serve --config vllm-config.yaml

Achieving P95 ≤ 80ms Latency

The book’s target SLO of P95 latency ≤ 80ms requires careful optimization:

1. Right-Size Your Model

# Smaller models = lower latency
model_latency_comparison = {
    "granite-3b": "~25ms P95",
    "granite-7b": "~45ms P95", 
    "granite-13b": "~80ms P95",
    "granite-34b": "~150ms P95 (needs optimization)"
}

2. Enable Speculative Decoding

python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-7b-instruct \
  --speculative-model ibm-granite/granite-3b-instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager

3. Optimize Batch Settings

# Production-optimized serving config
serving_config = {
    "max_num_seqs": 128,           # Limit concurrent sequences
    "max_num_batched_tokens": 8192, # Control batch size
    "enable_prefix_caching": True,  # Cache common prefixes
    "disable_log_requests": True,   # Reduce I/O overhead
}

Containerized Deployment with Podman

# Pull vLLM container
podman pull registry.redhat.io/rhel-ai/vllm-server:latest

# Run with GPU support
podman run -d \
  --name vllm-server \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  -v /models:/models:ro \
  registry.redhat.io/rhel-ai/vllm-server:latest \
  --model /models/granite-7b-instruct \
  --tensor-parallel-size 2

Systemd Service

# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
User=vllm
ExecStart=/usr/bin/podman run --rm \
  --name vllm-server \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  registry.redhat.io/rhel-ai/vllm-server:latest \
  --model ibm-granite/granite-7b-instruct
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Monitoring vLLM Performance

Prometheus Metrics

vLLM exposes metrics at /metrics:

# prometheus.yml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics

Key Metrics to Track

MetricDescriptionAlert Threshold
vllm:request_latency_secondsRequest latency histogramP95 > 80ms
vllm:num_requests_runningActive requests> max_num_seqs
vllm:gpu_cache_usage_percKV cache utilization> 95%
vllm:num_preemptions_totalRequest preemptions> 0/min

Grafana Dashboard Query

# P95 latency over time
histogram_quantile(0.95, 
  sum(rate(vllm:request_latency_seconds_bucket[5m])) by (le)
)

Load Balancing Multiple Instances

For high availability and throughput:

# nginx.conf
upstream vllm_cluster {
    least_conn;
    server vllm-node-1:8000 weight=1;
    server vllm-node-2:8000 weight=1;
    server vllm-node-3:8000 weight=1;
}

server {
    listen 80;
    
    location /v1 {
        proxy_pass http://vllm_cluster;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_connect_timeout 60s;
        proxy_read_timeout 300s;
    }
}

This article covers material from:


📚 Get the Complete Guide

Ready to master vLLM and enterprise AI deployment?

Practical RHEL AI provides comprehensive coverage of inference optimization, including:

🚀 Pre-Order Now - Available March 2026

Get Practical RHEL AI from Apress and start deploying production-ready AI on Red Hat Enterprise Linux today.

Learn More →Buy on Amazon →
← Back to Blog