🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

Scaling AI Inference with vLLM on RHEL AI: Multi-Node Deployments

Luca Berton •
#rhel-ai#vllm#inference#scaling#multi-node#tensor-parallelism#load-balancing#kubernetes#high-availability

Scaling AI Inference with vLLM on RHEL AI: Multi-Node Deployments

As AI adoption accelerates across the enterprise, the demand for low-latency, high-throughput inference services grows exponentially. A single GPU server might handle development and testing, but production workloads—serving thousands of concurrent users—require a distributed inference architecture. This guide, based on Practical RHEL AI, walks through scaling vLLM inference across multiple nodes.

Understanding vLLM Architecture

vLLM (Very Large Language Model) is an inference engine optimized for serving LLMs at scale. Key features that make it ideal for enterprise deployments include:

Deployment Topology Options

Before scaling, understand your deployment options:

TopologyUse CaseComplexity
Single GPUDevelopment, small modelsLow
Multi-GPU (Single Node)Medium models, moderate trafficMedium
Multi-Node Tensor ParallelLarge models (70B+)High
Load-Balanced ReplicasHigh availability, horizontal scalingMedium

Prerequisites

For this multi-node deployment, you’ll need:

Method 1: Multi-GPU Tensor Parallelism (Single Node)

For models that exceed single GPU memory, use tensor parallelism within a node:

# Serve a 70B model across 4 GPUs
ilab model serve \
    --model-path /models/granite-70b \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096

Verify GPU utilization:

nvidia-smi dmon -s u -d 1

Method 2: Multi-Node Deployment with Ray

For truly large models or higher throughput, deploy across multiple nodes using Ray:

Step 1: Configure the Head Node

# On the head node (node1)
ray start --head --port=6379 --dashboard-host=0.0.0.0

# Note the Ray address displayed
# Example: ray://192.168.1.100:6379

Step 2: Join Worker Nodes

# On each worker node (node2, node3, etc.)
ray start --address='192.168.1.100:6379'

Step 3: Deploy vLLM with Ray Backend

#!/usr/bin/env python3
"""Multi-node vLLM deployment with Ray"""

from vllm import LLM, SamplingParams
import ray

# Initialize Ray cluster
ray.init(address="auto")

# Configure the model with tensor parallelism across nodes
llm = LLM(
    model="/shared-storage/models/granite-70b",
    tensor_parallel_size=8,  # 4 GPUs per node × 2 nodes
    pipeline_parallel_size=2,
    trust_remote_code=True,
    max_model_len=4096,
)

# Test inference
prompts = ["Explain RHEL AI in one paragraph:"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Method 3: Kubernetes/OpenShift Deployment

For production environments, deploy vLLM on OpenShift:

Step 1: Create the vLLM Deployment

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: rhel-ai
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
      - name: vllm
        image: registry.redhat.io/rhel-ai/vllm-runtime:latest
        args:
          - "--model=/models/granite-7b"
          - "--host=0.0.0.0"
          - "--port=8000"
          - "--tensor-parallel-size=1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "48Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

Step 2: Create the Service and Route

# vllm-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: rhel-ai
spec:
  selector:
    app: vllm-inference
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: vllm-route
  namespace: rhel-ai
spec:
  to:
    kind: Service
    name: vllm-service
  port:
    targetPort: 8000
  tls:
    termination: edge

Step 3: Configure Horizontal Pod Autoscaler

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: rhel-ai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: vllm_requests_running
      target:
        type: AverageValue
        averageValue: "50"

Performance Optimization

1. Continuous Batching Configuration

Optimize throughput with batching parameters:

ilab model serve \
    --model-path /models/granite-7b \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --scheduler-delay-factor 0.1

2. KV Cache Optimization

Configure KV cache for your memory constraints:

# Allow 90% GPU memory for KV cache
ilab model serve \
    --model-path /models/granite-7b \
    --gpu-memory-utilization 0.90 \
    --block-size 16

3. Quantization for Memory Efficiency

Use AWQ or GPTQ quantization for larger models:

# Serve a quantized model
ilab model serve \
    --model-path /models/granite-70b-awq \
    --quantization awq \
    --tensor-parallel-size 2

Load Balancing Strategies

HAProxy Configuration for vLLM:

# /etc/haproxy/haproxy.cfg
frontend vllm_frontend
    bind *:8080
    default_backend vllm_servers

backend vllm_servers
    balance leastconn
    option httpchk GET /health
    http-check expect status 200
    
    server vllm1 192.168.1.101:8000 check inter 5s fall 3 rise 2
    server vllm2 192.168.1.102:8000 check inter 5s fall 3 rise 2
    server vllm3 192.168.1.103:8000 check inter 5s fall 3 rise 2

Monitoring and Observability

vLLM exposes Prometheus metrics. Configure scraping:

# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm1:8000', 'vllm2:8000', 'vllm3:8000']
    metrics_path: /metrics

Key metrics to monitor:

MetricDescriptionAlert Threshold
vllm:num_requests_runningActive requests> 100
vllm:num_requests_waitingQueued requests> 50
vllm:gpu_cache_usage_percKV cache utilization> 95%
vllm:avg_generation_throughputTokens/second< baseline

Benchmarking Your Deployment

Validate performance with the vLLM benchmark tool:

# Run throughput benchmark
python -m vllm.entrypoints.openai.api_server_benchmark \
    --model /models/granite-7b \
    --num-prompts 1000 \
    --input-len 128 \
    --output-len 128 \
    --request-rate 10

Expected results for a 4×A100 deployment serving Granite-7B:

Conclusion

Scaling AI inference with vLLM on RHEL AI transforms a single-server prototype into a production-grade service capable of handling enterprise workloads. Whether you choose tensor parallelism for large models or horizontal scaling for high availability, RHEL AI’s integrated toolchain simplifies deployment and operations.

For advanced topics including A/B model deployments, canary releases, and inference cost optimization, refer to Chapters 10-12 of Practical RHEL AI.

Get Practical RHEL AI on Amazon

← Back to Blog