π Book Reference: This article is based on Chapter 2: Setup and Chapter 5: Custom Applications of Practical RHEL AI, covering container-based AI deployments using Podman.
Podman is the container runtime of choice for RHEL AI deployments. Unlike Docker, Podman runs daemonless and supports rootless containers out of the boxβcritical features for security-conscious enterprise AI deployments.
Practical RHEL AI recommends Podman for all containerized AI workloads, and this article shows you how.
| Feature | Podman | Docker |
|---|---|---|
| Daemonless | β Yes | β No |
| Rootless Containers | β Native | β οΈ Limited |
| SELinux Integration | β Full | β οΈ Partial |
| Systemd Integration | β Native | β οΈ Workarounds |
| OCI Compliant | β Yes | β Yes |
| GPU Support | β CDI | β nvidia-docker |
# Install Podman and GPU tools
sudo dnf install -y podman podman-plugins nvidia-container-toolkit
# Configure NVIDIA Container Toolkit for Podman
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Verify CDI configuration
nvidia-ctk cdi list
# Test GPU access in container
podman run --rm --device nvidia.com/gpu=all \
nvidia/cuda:12.4-base nvidia-smi# Pull the RHEL AI vLLM image
podman pull registry.redhat.io/rhel-ai/vllm-server:latest
# Run with GPU access
podman run -d \
--name vllm-inference \
--device nvidia.com/gpu=all \
-p 8000:8000 \
-v /opt/models:/models:ro,Z \
registry.redhat.io/rhel-ai/vllm-server:latest \
--model /models/granite-7b-instruct \
--host 0.0.0.0 \
--port 8000# Run as non-root user (no sudo!)
podman run -d \
--name vllm-rootless \
--device nvidia.com/gpu=all \
-p 8000:8000 \
-v ~/models:/models:ro \
registry.redhat.io/rhel-ai/vllm-server:latest \
--model /models/granite-7b-instruct
# Verify running as non-root
podman top vllm-rootless user# Containerfile.inference
FROM registry.redhat.io/rhel-ai/vllm-server:latest
LABEL maintainer="[email protected]"
LABEL version="1.0"
LABEL description="Custom fine-tuned Granite model"
# Copy fine-tuned model weights
COPY --chown=1001:1001 ./model-weights /opt/model
# Copy custom configuration
COPY vllm-config.yaml /opt/vllm/config.yaml
# Set environment variables
ENV MODEL_PATH=/opt/model
ENV VLLM_CONFIG=/opt/vllm/config.yaml
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Run vLLM server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--config", "/opt/vllm/config.yaml"]# Build the container
podman build -t my-registry.com/ai-models/granite-custom:v1.0 \
-f Containerfile.inference .
# Test locally
podman run --rm -d \
--name test-inference \
--device nvidia.com/gpu=all \
-p 8000:8000 \
my-registry.com/ai-models/granite-custom:v1.0
# Push to registry
podman push my-registry.com/ai-models/granite-custom:v1.0# Create pod with shared network
podman pod create \
--name ai-stack \
-p 8000:8000 \
-p 9090:9090 \
-p 3000:3000
# Add vLLM inference server
podman run -d \
--pod ai-stack \
--name vllm \
--device nvidia.com/gpu=all \
-v /opt/models:/models:ro,Z \
registry.redhat.io/rhel-ai/vllm-server:latest \
--model /models/granite-7b-instruct
# Add Prometheus for monitoring
podman run -d \
--pod ai-stack \
--name prometheus \
-v ./prometheus.yml:/etc/prometheus/prometheus.yml:ro,Z \
prom/prometheus:latest
# Add Grafana for visualization
podman run -d \
--pod ai-stack \
--name grafana \
-v grafana-data:/var/lib/grafana:Z \
grafana/grafana:latest
# Check pod status
podman pod ps
podman ps --pod# ai-stack-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: ai-stack
labels:
app: rhel-ai
spec:
containers:
- name: vllm
image: registry.redhat.io/rhel-ai/vllm-server:latest
args:
- "--model"
- "/models/granite-7b-instruct"
ports:
- containerPort: 8000
volumeMounts:
- name: models
mountPath: /models
readOnly: true
resources:
limits:
nvidia.com/gpu: 1
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: grafana
image: grafana/grafana:latest
ports:
- containerPort: 3000
volumes:
- name: models
hostPath:
path: /opt/models
- name: prometheus-config
hostPath:
path: /opt/prometheus# Deploy from YAML
podman play kube ai-stack-pod.yaml# Generate service file from running container
podman generate systemd --new --name vllm-inference \
> ~/.config/systemd/user/vllm-inference.service
# Or for system-wide (requires root)
sudo podman generate systemd --new --name vllm-inference \
> /etc/systemd/system/vllm-inference.service# /etc/systemd/system/rhel-ai-vllm.service
[Unit]
Description=RHEL AI vLLM Inference Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
Restart=always
RestartSec=10
TimeoutStartSec=300
ExecStartPre=-/usr/bin/podman stop vllm-inference
ExecStartPre=-/usr/bin/podman rm vllm-inference
ExecStartPre=/usr/bin/podman pull registry.redhat.io/rhel-ai/vllm-server:latest
ExecStart=/usr/bin/podman run \
--name vllm-inference \
--device nvidia.com/gpu=all \
--publish 8000:8000 \
--volume /opt/models:/models:ro,Z \
registry.redhat.io/rhel-ai/vllm-server:latest \
--model /models/granite-7b-instruct
ExecStop=/usr/bin/podman stop vllm-inference
ExecStopPost=/usr/bin/podman rm -f vllm-inference
[Install]
WantedBy=multi-user.target# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable --now rhel-ai-vllm.service
sudo systemctl status rhel-ai-vllm.service# Ensure proper SELinux labels on model directory
sudo semanage fcontext -a -t container_file_t '/opt/models(/.*)?'
sudo restorecon -Rv /opt/models
# Verify SELinux context
ls -laZ /opt/models# Run with memory and CPU limits
podman run -d \
--name vllm-limited \
--device nvidia.com/gpu=all \
--memory 64g \
--cpus 16 \
--pids-limit 1000 \
-p 8000:8000 \
registry.redhat.io/rhel-ai/vllm-server:latest \
--model /models/granite-7b-instruct# Enhanced security with read-only root
podman run -d \
--name vllm-secure \
--device nvidia.com/gpu=all \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid \
-v /opt/models:/models:ro,Z \
-p 8000:8000 \
registry.redhat.io/rhel-ai/vllm-server:latest# Create dedicated network
podman network create ai-network
# Run containers on custom network
podman run -d \
--name vllm \
--network ai-network \
--device nvidia.com/gpu=all \
registry.redhat.io/rhel-ai/vllm-server:latest
podman run -d \
--name api-gateway \
--network ai-network \
-p 80:80 \
nginx:latest
# Containers can communicate by name
# api-gateway can reach vllm at http://vllm:8000# Real-time stats
podman stats vllm-inference
# Output:
# ID NAME CPU % MEM USAGE / LIMIT NET I/O BLOCK I/O
# abc123def456 vllm-inference 45.2% 48.5GiB / 64GiB 1.2GB / 500MB 50MB / 0B
# GPU monitoring inside container
podman exec vllm-inference nvidia-smi dmon -s muThis article covers material from:
Ready to containerize your AI infrastructure?
Practical RHEL AI provides complete container deployment guidance:
Practical RHEL AI shows you how to build secure, scalable, containerized AI infrastructure with Podman.
Learn More βBuy on Amazon β