What AI and cloud consulting services does Luca Berton offer?

Luca Berton provides expert consulting in AI/ML platform strategy, multi-tenant GPU orchestration on OpenShift AI, MLOps enablement, cloud infrastructure design, Kubernetes workshops, and Ansible & Python training.

What is Ansible Pilot?

Ansible Pilot is the leading resource for Ansible automation learning, featuring a YouTube channel with 6.1K subscribers and 1M+ views, plus AnsiblePilot.com with 648K total users.

How can I book a consultation with Luca Berton?

Schedule a free consultation through Calendly at calendly.com/lucaberton or visit lucaberton.com/contact.

AI

Running LLMs on OpenShift AI: A Complete Deployment Guide

Luca Berton • Thu Feb 26 2026 • 1 min read •

#openshift#ai#llm#vllm#gpu

\n## 🚀 LLMs on OpenShift AI

OpenShift AI provides an enterprise-grade platform for deploying large language models. Here’s the complete guide from model selection to production serving.

Platform Setup

Prerequisites

# Install OpenShift AI operator
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
spec:
  channel: stable
  name: rhods-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

# Install NVIDIA GPU Operator
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: v24.9
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

Model Serving with vLLM

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: granite-34b
  namespace: ai-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-runtime
      storageUri: s3://models/granite-34b-code-instruct
    resources:
      limits:
        nvidia.com/gpu: "2"
        memory: "80Gi"
      requests:
        cpu: "8"
        memory: "64Gi"

Custom vLLM Runtime

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-runtime
spec:
  containers:
  - name: vllm
    image: quay.io/modh/vllm:latest
    args:
    - "--model=/mnt/models"
    - "--max-model-len=8192"
    - "--tensor-parallel-size=2"
    - "--gpu-memory-utilization=0.9"
    - "--enable-chunked-prefill"
    ports:
    - containerPort: 8000
      protocol: TCP
    volumeMounts:
    - name: shm
      mountPath: /dev/shm
  volumes:
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: 12Gi
  supportedModelFormats:
  - name: vLLM
    autoSelect: true

Model Selection Guide

Model	Size	GPU Requirement	Best For
Granite 8B	8B	1x A100 40GB	Code generation, general tasks
Granite 34B	34B	2x A100 80GB	Complex reasoning, RAG
Llama 3.1 70B	70B	4x A100 80GB	Maximum capability
Mistral 7B	7B	1x T4/A10	Cost-effective inference

Autoscaling Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: granite-34b-predictor
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Production Checklist

GPU node pools with proper taints/tolerations
Model stored in S3/ODF with fast retrieval
vLLM configured with tensor parallelism for large models
HPA based on queue depth (not CPU)
Prometheus monitoring for token throughput and latency
Rate limiting per user/team
API key authentication
Response caching for common queries

Deploying LLMs on OpenShift? I help organizations build production AI serving platforms. Get in touch.\n

Share:

📌 Need expert help with this topic?

🧠

AI Integration & GPU Platforms

Need help deploying AI/ML platforms? Get expert consulting on OpenShift AI, GPU orchestration, and MLOps.

☸️

Kubernetes & Containerization

Master Kubernetes and container orchestration with hands-on workshops and architecture consulting.

Book a free consultation →

LB

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

LinkedIn Bluesky YouTube Contact →