Scaling AI Inference with vLLM on RHEL AI: Multi-Node Deployments
As AI adoption accelerates across the enterprise, the demand for low-latency, high-throughput inference services grows exponentially. A single GPU server might handle development and testing, but production workloads—serving thousands of concurrent users—require a distributed inference architecture. This guide, based on Practical RHEL AI, walks through scaling vLLM inference across multiple nodes.
Understanding vLLM Architecture
vLLM (Very Large Language Model) is an inference engine optimized for serving LLMs at scale. Key features that make it ideal for enterprise deployments include:
- PagedAttention: Efficient memory management reducing GPU memory waste by up to 24x
- Continuous Batching: Dynamic request batching for optimal throughput
- Tensor Parallelism: Split large models across multiple GPUs
- Pipeline Parallelism: Distribute model layers across nodes
- OpenAI-Compatible API: Drop-in replacement for existing integrations
Deployment Topology Options
Before scaling, understand your deployment options:
| Topology | Use Case | Complexity |
|---|---|---|
| Single GPU | Development, small models | Low |
| Multi-GPU (Single Node) | Medium models, moderate traffic | Medium |
| Multi-Node Tensor Parallel | Large models (70B+) | High |
| Load-Balanced Replicas | High availability, horizontal scaling | Medium |
Prerequisites
For this multi-node deployment, you’ll need:
- 2+ RHEL AI nodes with GPU support
- High-speed interconnect (InfiniBand or 100GbE recommended)
- Shared storage (NFS or parallel filesystem)
- Container orchestration (Kubernetes/OpenShift or Podman)
Method 1: Multi-GPU Tensor Parallelism (Single Node)
For models that exceed single GPU memory, use tensor parallelism within a node:
# Serve a 70B model across 4 GPUs
ilab model serve \
--model-path /models/granite-70b \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096Verify GPU utilization:
nvidia-smi dmon -s u -d 1Method 2: Multi-Node Deployment with Ray
For truly large models or higher throughput, deploy across multiple nodes using Ray:
Step 1: Configure the Head Node
# On the head node (node1)
ray start --head --port=6379 --dashboard-host=0.0.0.0
# Note the Ray address displayed
# Example: ray://192.168.1.100:6379Step 2: Join Worker Nodes
# On each worker node (node2, node3, etc.)
ray start --address='192.168.1.100:6379'Step 3: Deploy vLLM with Ray Backend
#!/usr/bin/env python3
"""Multi-node vLLM deployment with Ray"""
from vllm import LLM, SamplingParams
import ray
# Initialize Ray cluster
ray.init(address="auto")
# Configure the model with tensor parallelism across nodes
llm = LLM(
model="/shared-storage/models/granite-70b",
tensor_parallel_size=8, # 4 GPUs per node × 2 nodes
pipeline_parallel_size=2,
trust_remote_code=True,
max_model_len=4096,
)
# Test inference
prompts = ["Explain RHEL AI in one paragraph:"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)Method 3: Kubernetes/OpenShift Deployment
For production environments, deploy vLLM on OpenShift:
Step 1: Create the vLLM Deployment
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: rhel-ai
spec:
replicas: 3
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: registry.redhat.io/rhel-ai/vllm-runtime:latest
args:
- "--model=/models/granite-7b"
- "--host=0.0.0.0"
- "--port=8000"
- "--tensor-parallel-size=1"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "48Gi"
requests:
nvidia.com/gpu: 1
memory: "32Gi"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvcStep 2: Create the Service and Route
# vllm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: rhel-ai
spec:
selector:
app: vllm-inference
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: vllm-route
namespace: rhel-ai
spec:
to:
kind: Service
name: vllm-service
port:
targetPort: 8000
tls:
termination: edgeStep 3: Configure Horizontal Pod Autoscaler
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: rhel-ai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: vllm_requests_running
target:
type: AverageValue
averageValue: "50"Performance Optimization
1. Continuous Batching Configuration
Optimize throughput with batching parameters:
ilab model serve \
--model-path /models/granite-7b \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--scheduler-delay-factor 0.12. KV Cache Optimization
Configure KV cache for your memory constraints:
# Allow 90% GPU memory for KV cache
ilab model serve \
--model-path /models/granite-7b \
--gpu-memory-utilization 0.90 \
--block-size 163. Quantization for Memory Efficiency
Use AWQ or GPTQ quantization for larger models:
# Serve a quantized model
ilab model serve \
--model-path /models/granite-70b-awq \
--quantization awq \
--tensor-parallel-size 2Load Balancing Strategies
HAProxy Configuration for vLLM:
# /etc/haproxy/haproxy.cfg
frontend vllm_frontend
bind *:8080
default_backend vllm_servers
backend vllm_servers
balance leastconn
option httpchk GET /health
http-check expect status 200
server vllm1 192.168.1.101:8000 check inter 5s fall 3 rise 2
server vllm2 192.168.1.102:8000 check inter 5s fall 3 rise 2
server vllm3 192.168.1.103:8000 check inter 5s fall 3 rise 2Monitoring and Observability
vLLM exposes Prometheus metrics. Configure scraping:
# prometheus-scrape-config.yaml
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm1:8000', 'vllm2:8000', 'vllm3:8000']
metrics_path: /metricsKey metrics to monitor:
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:num_requests_running | Active requests | > 100 |
vllm:num_requests_waiting | Queued requests | > 50 |
vllm:gpu_cache_usage_perc | KV cache utilization | > 95% |
vllm:avg_generation_throughput | Tokens/second | < baseline |
Benchmarking Your Deployment
Validate performance with the vLLM benchmark tool:
# Run throughput benchmark
python -m vllm.entrypoints.openai.api_server_benchmark \
--model /models/granite-7b \
--num-prompts 1000 \
--input-len 128 \
--output-len 128 \
--request-rate 10Expected results for a 4×A100 deployment serving Granite-7B:
- Throughput: 2,000+ tokens/second
- Latency (P50): < 50ms time-to-first-token
- Latency (P99): < 200ms time-to-first-token
Conclusion
Scaling AI inference with vLLM on RHEL AI transforms a single-server prototype into a production-grade service capable of handling enterprise workloads. Whether you choose tensor parallelism for large models or horizontal scaling for high availability, RHEL AI’s integrated toolchain simplifies deployment and operations.
For advanced topics including A/B model deployments, canary releases, and inference cost optimization, refer to Chapters 10-12 of Practical RHEL AI.
Get Practical RHEL AI on Amazon