Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Kubernetes AI Conformance Checklist for Platform Engineers
Platform Engineering

Kubernetes AI Conformance: Platform Engineer Checklist

Practical checklist for Kubernetes AI Conformance. GPU scheduling, DRA, network performance, storage, autoscaling, and agentic workload requirements.

LB
Luca Berton
Β· 3 min read

The CNCF Kubernetes AI Conformance Program defines what β€œAI-ready” means. But what does a platform engineer actually need to implement? Here is the technical checklist.

Accelerator Management

Device Plugin Support

# Verify GPU device plugin is running
kubectl get pods -n kube-system -l app=nvidia-device-plugin

# Check allocatable GPUs
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

Requirements:

  • NVIDIA device plugin (or equivalent for AMD, Intel, Habana)
  • GPU resource requests and limits in pod specs
  • GPU health monitoring and automatic unhealthy node drain

Dynamic Resource Allocation (DRA)

DRA is the next-generation alternative to device plugins:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        count: 2

For a deep dive on GPU management, see NVIDIA GPU Operator for Kubernetes.

Scheduling

Topology-Aware Scheduling

AI training jobs need GPUs that are physically close (same node, same NVLink fabric):

apiVersion: v1
kind: Pod
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
  resources:
    limits:
      nvidia.com/gpu: 4

Gang Scheduling

Training jobs need all pods scheduled simultaneously or not at all:

# Using Volcano or Coscheduling plugin
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training-job
spec:
  minMember: 8
  queue: default

Priority and Preemption

Inference workloads should preempt batch training when capacity is tight:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: inference-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Production inference workloads"

Networking

High-Bandwidth Interconnect

Distributed training needs RDMA or equivalent:

  • InfiniBand / RoCE v2 support
  • SR-IOV for network device passthrough
  • Network policies that do not block training traffic
  • MTU 9000 (jumbo frames) for training networks

For SR-IOV details, see SR-IOV NIC Cluster Policy for Kubernetes.

Model-Aware Load Balancing

Inference endpoints need routing that understands model readiness:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  rules:
    - matches:
        - path:
            value: /v1/completions
      backendRefs:
        - name: vllm-service
          port: 8000

Storage

Model Artifact Storage

  • ReadWriteMany (RWX) volumes for shared model access
  • CSI drivers with snapshot support for model versioning
  • At least 1 TB capacity for large model weights
  • SSD-backed storage for model loading performance
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500Gi
  storageClassName: fast-rwx

Checkpoint Storage

Training checkpoints need durable, fast storage:

  • Periodic checkpoint writes (every N steps)
  • Fast restore for job preemption recovery
  • Lifecycle policies for checkpoint retention

Autoscaling

Inference Autoscaling

# KEDA ScaledObject for inference queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_num_requests_waiting
        query: sum(vllm_num_requests_waiting)
        threshold: "10"

For standard HPA vs KEDA comparison, see KEDA vs HPA.

Scale-to-Zero

Cost optimization for inference endpoints with variable traffic:

# KEDA with scale-to-zero
spec:
  minReplicaCount: 0
  cooldownPeriod: 300
  idleReplicaCount: 0

Agentic Workload Support

The newest conformance category. Requirements are still evolving but include:

  • Durable execution β€” workflow engines (Argo Workflows, Temporal) for multi-step agents
  • Tool calling β€” service mesh or API gateway for external tool integration
  • State persistence β€” PVCs or external state stores for agent memory
  • Timeout management β€” configurable timeouts for non-deterministic LLM calls

Implementation Checklist

CategoryRequirementPriority
GPUDevice plugin installed and healthyCritical
GPUDRA support (K8s 1.31+)Recommended
GPUMIG or time-slicing configuredRecommended
SchedulingTopology-aware placementCritical
SchedulingGang scheduling pluginCritical
SchedulingPriority classes for inferenceImportant
NetworkRDMA/RoCE for trainingImportant
NetworkSR-IOV device pluginRecommended
StorageRWX volumes for modelsCritical
StorageCheckpoint storage with snapshotsImportant
AutoscalingGPU-aware HPA or KEDACritical
AutoscalingScale-to-zero for inferenceRecommended
MonitoringGPU metrics in PrometheusCritical
SecurityGPU isolation between tenantsCritical

Contribute to the Program

The conformance requirements are open source:

Agentic workload requirements need the most input. If you are building AI agent platforms on Kubernetes, your experience is valuable.

About the Author

I am Luca Berton, AI and Cloud Advisor. I presented on multi-tenant GPU scheduling at KubeCon EU 2026 and help platform teams build AI-ready Kubernetes clusters. Book a consultation to assess your AI platform readiness.

Free 30-min AI & Cloud consultation

Book Now