Skip to main content
πŸš€ Claude Code Bootcamp β€” May 30 5 hours from prompting to production. Build 10 real-world projects with AI-assisted development. Register Now
NVIDIA Run:ai Dynamo Gang Scheduling Topology-Aware Inference 2026
AI

Run:ai + Dynamo: Gang Scheduling for

NVIDIA Run:ai v2.23 integrates with Dynamo for gang scheduling and topology-aware placement. Deploy disaggregated prefill/decode workloads atomically on.

LB
Luca Berton
Β· 3 min read

Running NVIDIA Dynamo across multiple nodes creates an orchestration problem: prefill workers, decode workers, routers, and control planes must all start together and land on the right hardware. If decode pods run while prefill pods are still pending, you have idle GPUs burning money. If components scatter across racks, cross-node latency kills throughput.

NVIDIA Run:ai v2.23 solves both problems through native Dynamo integration: gang scheduling ensures all components launch atomically, and topology-aware placement co-locates them for minimum latency.

The Coordination Problem

A Dynamo disaggregated inference deployment has multiple tightly coupled components:

Router β†’ Prefill Workers (compute-bound)
       β†’ Decode Workers (memory-bound)
       β†’ Control Plane

Without coordination, standard Kubernetes scheduling creates:

  • Partial deployments β€” decode pods running while prefill pods are pending (GPUs idle, no inference happening)
  • Resource fragmentation β€” partially deployed workloads consuming cluster resources while waiting for missing components
  • Poor placement β€” leaders and workers spread across distant nodes, cross-rack communication bottlenecks

Gang Scheduling: All-or-Nothing

Run:ai treats a Dynamo deployment as a single scheduling unit. Either all components (prefill leaders/workers, decode leaders/workers, router) are placed simultaneously, or the entire deployment waits until sufficient resources are available.

Benefits:

  • No partial deployments β€” eliminates idle GPU waste from half-started workloads
  • Higher cluster utilization β€” no resource fragmentation from pending components
  • Reduced cold start β€” entire workloads launch atomically when resources free up
  • Zero configuration β€” the scheduler manages coordination automatically

Topology-Aware Placement: Right GPUs, Right Location

Administrators define the cluster physical topology in the Run:ai UI using Kubernetes node labels:

topology.kubernetes.io/region: eu-west
topology.kubernetes.io/zone: eu-west-1a
topology.kubernetes.io/rack: rack-42

The scheduler then co-locates Dynamo components at the closest available tier:

  1. Preferred: Try same rack first (lowest latency)
  2. Fallback: Same zone if rack placement is not possible
  3. Last resort: Same region

For disaggregated inference, this is critical β€” NIXL transfers between prefill and decode GPUs benefit enormously from NVLink or same-rack InfiniBand versus cross-rack Ethernet.

Step-by-Step Deployment

Prerequisites

  • Kubernetes cluster with Run:ai v2.23 installed
  • A project (e.g., runai-project-a) initialized
  • Helm installed
  • HuggingFace token as K8s secret:
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN='<your-token>' \
  -n runai-project-a

Step 1: Configure Network Topology

Label your nodes with proximity indicators:

kubectl label node gpu-node-01 topology.kubernetes.io/zone=eu-west-1a
kubectl label node gpu-node-02 topology.kubernetes.io/zone=eu-west-1a
kubectl label node gpu-node-03 topology.kubernetes.io/zone=eu-west-1b

In the Run:ai UI:

  1. Open cluster settings
  2. Add label keys (topology.kubernetes.io/zone, topology.kubernetes.io/region)
  3. Create a topology ordering keys from closest to farthest
  4. Attach the topology to the relevant node pool(s)

Step 2: Install Dynamo CRDs and Platform

export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
export NAMESPACE=dynamo-cloud
export RELEASE_VERSION=0.5.1

kubectl create namespace $NAMESPACE

# CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
  --namespace dynamo-cloud

# Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace ${NAMESPACE} \
  --set dynamo-operator.namespaceRestriction.enabled=false

Verify:

kubectl -n $NAMESPACE get pods

Step 3: Deploy Disaggregated vLLM

Download the example YAML from Dynamo and add Run:ai annotations:

metadata:
  namespace: runai-project-a
  annotations:
    kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
    kai.scheduler/topology: "topology-1"

For strict co-location (pods must be in the same zone or deployment fails):

annotations:
  kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
  kai.scheduler/topology: "topology-1"

Apply:

kubectl apply -f disagg.yaml

Step 4: Verify Deployment

All components should be running, with prefill and decode pods scheduled in the same zone:

NAME                                              READY  STATUS   AGE
vllm-disagg-frontend-79f459c95-57fm6              1/1    Running  30m
vllm-disagg-vllmdecodeworker-6c8d64f569-56phf     1/1    Running  30m
vllm-disagg-vllmprefillworker-755cb88fcf-pflb5    1/1    Running  30m

Step 5: Test

Port-forward and send a request:

kubectl -n runai-project-a port-forward \
  pod/vllm-disagg-frontend-79f459c95-57fm6 8000:8000

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 100
  }'

The endpoint is OpenAI-compatible β€” any client that works with OpenAI’s API works here.

Preferred vs Required Placement

StrategyAnnotationBehavior
Preferredtopology-preferred-placementBest-effort co-location; relaxes to broader tiers if needed
Requiredtopology-required-placementStrict; deployment waits until all pods fit in the same tier

Recommendation: Start with preferred for production. Use required only when latency requirements are absolute (multi-node tensor parallelism over InfiniBand).

How It All Fits Together

Run:ai v2.23
β”œβ”€β”€ Gang Scheduler β†’ All Dynamo pods launch atomically
β”œβ”€β”€ Topology Awareness β†’ Pods co-located on nearest nodes
└── KAI Scheduler β†’ Open source scheduler engine
         β”‚
    NVIDIA Dynamo
    β”œβ”€β”€ SLO Planner β†’ Dynamic GPU reallocation
    β”œβ”€β”€ KV-aware Router β†’ Route to cached data
    β”œβ”€β”€ NIXL β†’ Low-latency KV cache transfer
    └── Disaggregated Prefill/Decode

Run:ai handles where and when pods run. Dynamo handles how inference executes within those pods. Together: predictable latency, maximum GPU utilization, zero partial deployments.

About the Author

I am Luca Berton, AI and Cloud Advisor. I architect multi-node GPU inference platforms with Run:ai and Dynamo for enterprise workloads. Book a consultation.

Free 30-min AI & Cloud consultation

Book Now