Skip to main content
🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
NVIDIA GPU Operator on Kubernetes: Complete Setup
Platform Engineering

NVIDIA GPU Operator on Kubernetes: Complete Setup

Step-by-step guide to deploying the NVIDIA GPU Operator on Kubernetes for automated GPU driver management, device plugin, and container toolkit configuration.

LB
Luca Berton
Β· 2 min read

Managing GPU nodes in Kubernetes used to mean manually installing drivers, container toolkits, and device plugins on every node. The NVIDIA GPU Operator automates all of this through Kubernetes-native operators and custom resources.

What the GPU Operator Manages

The GPU Operator deploys and manages the entire GPU software stack as a set of Kubernetes pods:

  • NVIDIA Driver (or pre-installed driver detection)
  • NVIDIA Container Toolkit (nvidia-container-runtime)
  • NVIDIA Device Plugin (exposes GPUs as schedulable resources)
  • NVIDIA DCGM (GPU monitoring and diagnostics)
  • NVIDIA DCGM Exporter (Prometheus metrics)
  • NVIDIA MIG Manager (Multi-Instance GPU partitioning)
  • NVIDIA GDS Driver (GPUDirect Storage, covered in a separate article)
  • Node Feature Discovery (NFD for GPU node labeling)

Prerequisites

Before installing the GPU Operator:

  • Kubernetes 1.27+ cluster with GPU nodes
  • Nodes with NVIDIA GPUs (A100, H100, L40S, T4, etc.)
  • Helm 3 installed
  • No pre-existing NVIDIA drivers on GPU nodes (unless using pre-installed driver mode)
# Verify GPU hardware is detected
lspci | grep -i nvidia

Installation with Helm

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gds.enabled=false

Verifying the Installation

# Check all GPU Operator pods are running
kubectl get pods -n gpu-operator

# Expected output shows pods for each component:
# gpu-operator-xxxx                    Running
# nvidia-driver-daemonset-xxxx         Running
# nvidia-container-toolkit-xxxx        Running
# nvidia-device-plugin-xxxx            Running
# nvidia-dcgm-exporter-xxxx            Running
# gpu-feature-discovery-xxxx           Running

Verify GPUs are schedulable:

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

Running a GPU Workload

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
    - name: cuda-test
      image: nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
kubectl apply -f gpu-test.yaml
kubectl logs gpu-test
# Should show nvidia-smi output with your GPU details

ClusterPolicy Configuration

The GPU Operator is configured through a ClusterPolicy custom resource:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: "550.127.05"
    repository: nvcr.io/nvidia
    image: driver
    manager:
      env:
        - name: ENABLE_AUTO_DRAIN
          value: "true"
  toolkit:
    enabled: true
    version: v1.16.2-ubuntu20.04
  devicePlugin:
    enabled: true
    version: v0.16.2
    config:
      name: device-plugin-config
      default: default
  dcgmExporter:
    enabled: true
    version: 3.3.8-3.6.0-ubuntu22.04
  migManager:
    enabled: false
  gds:
    enabled: false
  nodeStatusExporter:
    enabled: true

Driver Version Management

Pin a specific driver version to ensure consistency across your cluster:

helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.version="550.127.05"

For air-gapped environments, mirror the driver image to your internal registry:

skopeo copy \
  docker://nvcr.io/nvidia/driver:550.127.05-ubuntu22.04 \
  docker://registry.internal/nvidia/driver:550.127.05-ubuntu22.04

Pre-Installed Drivers

If your nodes already have NVIDIA drivers installed (common with cloud providers like GKE, EKS, AKS):

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false

The operator will detect the pre-installed drivers and configure the rest of the stack accordingly.

Monitoring with Prometheus

The DCGM Exporter provides GPU metrics for Prometheus:

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key metrics:

  • DCGM_FI_DEV_GPU_UTIL β€” GPU utilization percentage
  • DCGM_FI_DEV_FB_USED β€” GPU memory used (MiB)
  • DCGM_FI_DEV_FB_FREE β€” GPU memory free (MiB)
  • DCGM_FI_DEV_GPU_TEMP β€” GPU temperature
  • DCGM_FI_DEV_POWER_USAGE β€” Power consumption (watts)

Automating with Ansible

For teams managing multiple Kubernetes clusters, Ansible can automate the GPU Operator deployment:

---
- name: Deploy NVIDIA GPU Operator
  hosts: localhost
  vars:
    gpu_operator_version: "v24.9.0"
    driver_version: "550.127.05"
  tasks:
    - name: Add NVIDIA Helm repo
      kubernetes.core.helm_repository:
        name: nvidia
        repo_url: https://helm.ngc.nvidia.com/nvidia

    - name: Install GPU Operator
      kubernetes.core.helm:
        name: gpu-operator
        chart_ref: nvidia/gpu-operator
        release_namespace: gpu-operator
        create_namespace: true
        values:
          driver:
            enabled: true
            version: "{{ driver_version }}"
          toolkit:
            enabled: true
          devicePlugin:
            enabled: true
          dcgmExporter:
            enabled: true

    - name: Wait for GPU Operator pods
      kubernetes.core.k8s_info:
        kind: Pod
        namespace: gpu-operator
        label_selectors:
          - app=gpu-operator
      register: pods
      until: pods.resources | selectattr('status.phase', 'equalto', 'Running') | list | length > 0
      retries: 30
      delay: 10

Troubleshooting

Driver Pod Stuck in Init

# Check driver pod logs
kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxx -c nvidia-driver-ctr

# Common causes:
# - Secure Boot enabled (disable in BIOS or use signed drivers)
# - Kernel headers missing (install kernel-devel package)
# - Conflicting nouveau driver (blacklist it)

GPUs Not Visible

# Check node labels
kubectl get nodes --show-labels | grep nvidia

# Check device plugin logs
kubectl logs -n gpu-operator nvidia-device-plugin-xxxx

Final Thoughts

The GPU Operator transforms GPU node management from a manual, error-prone process into a Kubernetes-native, declarative workflow. Install it once, and every GPU node in your cluster is automatically configured with the right drivers, toolkit, and monitoring. Combined with the GPU Cost Calculator for capacity planning, you have a complete GPU infrastructure management solution.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut