The CNCF Kubernetes AI Conformance Program defines what βAI-readyβ means. But what does a platform engineer actually need to implement? Here is the technical checklist.
Accelerator Management
Device Plugin Support
# Verify GPU device plugin is running
kubectl get pods -n kube-system -l app=nvidia-device-plugin
# Check allocatable GPUs
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'Requirements:
- NVIDIA device plugin (or equivalent for AMD, Intel, Habana)
- GPU resource requests and limits in pod specs
- GPU health monitoring and automatic unhealthy node drain
Dynamic Resource Allocation (DRA)
DRA is the next-generation alternative to device plugins:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
count: 2For a deep dive on GPU management, see NVIDIA GPU Operator for Kubernetes.
Scheduling
Topology-Aware Scheduling
AI training jobs need GPUs that are physically close (same node, same NVLink fabric):
apiVersion: v1
kind: Pod
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
resources:
limits:
nvidia.com/gpu: 4Gang Scheduling
Training jobs need all pods scheduled simultaneously or not at all:
# Using Volcano or Coscheduling plugin
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: training-job
spec:
minMember: 8
queue: defaultPriority and Preemption
Inference workloads should preempt batch training when capacity is tight:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: inference-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Production inference workloads"Networking
High-Bandwidth Interconnect
Distributed training needs RDMA or equivalent:
- InfiniBand / RoCE v2 support
- SR-IOV for network device passthrough
- Network policies that do not block training traffic
- MTU 9000 (jumbo frames) for training networks
For SR-IOV details, see SR-IOV NIC Cluster Policy for Kubernetes.
Model-Aware Load Balancing
Inference endpoints need routing that understands model readiness:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-route
spec:
rules:
- matches:
- path:
value: /v1/completions
backendRefs:
- name: vllm-service
port: 8000Storage
Model Artifact Storage
- ReadWriteMany (RWX) volumes for shared model access
- CSI drivers with snapshot support for model versioning
- At least 1 TB capacity for large model weights
- SSD-backed storage for model loading performance
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500Gi
storageClassName: fast-rwxCheckpoint Storage
Training checkpoints need durable, fast storage:
- Periodic checkpoint writes (every N steps)
- Fast restore for job preemption recovery
- Lifecycle policies for checkpoint retention
Autoscaling
Inference Autoscaling
# KEDA ScaledObject for inference queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_waiting
query: sum(vllm_num_requests_waiting)
threshold: "10"For standard HPA vs KEDA comparison, see KEDA vs HPA.
Scale-to-Zero
Cost optimization for inference endpoints with variable traffic:
# KEDA with scale-to-zero
spec:
minReplicaCount: 0
cooldownPeriod: 300
idleReplicaCount: 0Agentic Workload Support
The newest conformance category. Requirements are still evolving but include:
- Durable execution β workflow engines (Argo Workflows, Temporal) for multi-step agents
- Tool calling β service mesh or API gateway for external tool integration
- State persistence β PVCs or external state stores for agent memory
- Timeout management β configurable timeouts for non-deterministic LLM calls
Implementation Checklist
| Category | Requirement | Priority |
|---|---|---|
| GPU | Device plugin installed and healthy | Critical |
| GPU | DRA support (K8s 1.31+) | Recommended |
| GPU | MIG or time-slicing configured | Recommended |
| Scheduling | Topology-aware placement | Critical |
| Scheduling | Gang scheduling plugin | Critical |
| Scheduling | Priority classes for inference | Important |
| Network | RDMA/RoCE for training | Important |
| Network | SR-IOV device plugin | Recommended |
| Storage | RWX volumes for models | Critical |
| Storage | Checkpoint storage with snapshots | Important |
| Autoscaling | GPU-aware HPA or KEDA | Critical |
| Autoscaling | Scale-to-zero for inference | Recommended |
| Monitoring | GPU metrics in Prometheus | Critical |
| Security | GPU isolation between tenants | Critical |
Contribute to the Program
The conformance requirements are open source:
- GitHub: cncf/k8s-ai-conformance
- Project: kubernetes-sigs/ai-conformance
- Meetings: SIG Architecture AI Conformance meetings
Agentic workload requirements need the most input. If you are building AI agent platforms on Kubernetes, your experience is valuable.
Related Resources
- Kubernetes AI Conformance Program Overview
- Google at KubeCon EU 2026
- GPU Kubernetes Guide
- NVIDIA GPU Operator
- Kubernetes Monitoring with Prometheus
- Multi-Tenant GPUs on Bare Metal
About the Author
I am Luca Berton, AI and Cloud Advisor. I presented on multi-tenant GPU scheduling at KubeCon EU 2026 and help platform teams build AI-ready Kubernetes clusters. Book a consultation to assess your AI platform readiness.


