Running NVIDIA Dynamo across multiple nodes creates an orchestration problem: prefill workers, decode workers, routers, and control planes must all start together and land on the right hardware. If decode pods run while prefill pods are still pending, you have idle GPUs burning money. If components scatter across racks, cross-node latency kills throughput.
NVIDIA Run:ai v2.23 solves both problems through native Dynamo integration: gang scheduling ensures all components launch atomically, and topology-aware placement co-locates them for minimum latency.
The Coordination Problem
A Dynamo disaggregated inference deployment has multiple tightly coupled components:
Router β Prefill Workers (compute-bound)
β Decode Workers (memory-bound)
β Control PlaneWithout coordination, standard Kubernetes scheduling creates:
- Partial deployments β decode pods running while prefill pods are pending (GPUs idle, no inference happening)
- Resource fragmentation β partially deployed workloads consuming cluster resources while waiting for missing components
- Poor placement β leaders and workers spread across distant nodes, cross-rack communication bottlenecks
Gang Scheduling: All-or-Nothing
Run:ai treats a Dynamo deployment as a single scheduling unit. Either all components (prefill leaders/workers, decode leaders/workers, router) are placed simultaneously, or the entire deployment waits until sufficient resources are available.
Benefits:
- No partial deployments β eliminates idle GPU waste from half-started workloads
- Higher cluster utilization β no resource fragmentation from pending components
- Reduced cold start β entire workloads launch atomically when resources free up
- Zero configuration β the scheduler manages coordination automatically
Topology-Aware Placement: Right GPUs, Right Location
Administrators define the cluster physical topology in the Run:ai UI using Kubernetes node labels:
topology.kubernetes.io/region: eu-west
topology.kubernetes.io/zone: eu-west-1a
topology.kubernetes.io/rack: rack-42The scheduler then co-locates Dynamo components at the closest available tier:
- Preferred: Try same rack first (lowest latency)
- Fallback: Same zone if rack placement is not possible
- Last resort: Same region
For disaggregated inference, this is critical β NIXL transfers between prefill and decode GPUs benefit enormously from NVLink or same-rack InfiniBand versus cross-rack Ethernet.
Step-by-Step Deployment
Prerequisites
- Kubernetes cluster with Run:ai v2.23 installed
- A project (e.g.,
runai-project-a) initialized - Helm installed
- HuggingFace token as K8s secret:
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN='<your-token>' \
-n runai-project-aStep 1: Configure Network Topology
Label your nodes with proximity indicators:
kubectl label node gpu-node-01 topology.kubernetes.io/zone=eu-west-1a
kubectl label node gpu-node-02 topology.kubernetes.io/zone=eu-west-1a
kubectl label node gpu-node-03 topology.kubernetes.io/zone=eu-west-1bIn the Run:ai UI:
- Open cluster settings
- Add label keys (
topology.kubernetes.io/zone,topology.kubernetes.io/region) - Create a topology ordering keys from closest to farthest
- Attach the topology to the relevant node pool(s)
Step 2: Install Dynamo CRDs and Platform
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
export NAMESPACE=dynamo-cloud
export RELEASE_VERSION=0.5.1
kubectl create namespace $NAMESPACE
# CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
--namespace dynamo-cloud
# Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--set dynamo-operator.namespaceRestriction.enabled=falseVerify:
kubectl -n $NAMESPACE get podsStep 3: Deploy Disaggregated vLLM
Download the example YAML from Dynamo and add Run:ai annotations:
metadata:
namespace: runai-project-a
annotations:
kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology: "topology-1"For strict co-location (pods must be in the same zone or deployment fails):
annotations:
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology: "topology-1"Apply:
kubectl apply -f disagg.yamlStep 4: Verify Deployment
All components should be running, with prefill and decode pods scheduled in the same zone:
NAME READY STATUS AGE
vllm-disagg-frontend-79f459c95-57fm6 1/1 Running 30m
vllm-disagg-vllmdecodeworker-6c8d64f569-56phf 1/1 Running 30m
vllm-disagg-vllmprefillworker-755cb88fcf-pflb5 1/1 Running 30mStep 5: Test
Port-forward and send a request:
kubectl -n runai-project-a port-forward \
pod/vllm-disagg-frontend-79f459c95-57fm6 8000:8000
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 100
}'The endpoint is OpenAI-compatible β any client that works with OpenAIβs API works here.
Preferred vs Required Placement
| Strategy | Annotation | Behavior |
|---|---|---|
| Preferred | topology-preferred-placement | Best-effort co-location; relaxes to broader tiers if needed |
| Required | topology-required-placement | Strict; deployment waits until all pods fit in the same tier |
Recommendation: Start with preferred for production. Use required only when latency requirements are absolute (multi-node tensor parallelism over InfiniBand).
How It All Fits Together
Run:ai v2.23
βββ Gang Scheduler β All Dynamo pods launch atomically
βββ Topology Awareness β Pods co-located on nearest nodes
βββ KAI Scheduler β Open source scheduler engine
β
NVIDIA Dynamo
βββ SLO Planner β Dynamic GPU reallocation
βββ KV-aware Router β Route to cached data
βββ NIXL β Low-latency KV cache transfer
βββ Disaggregated Prefill/DecodeRun:ai handles where and when pods run. Dynamo handles how inference executes within those pods. Together: predictable latency, maximum GPU utilization, zero partial deployments.
Related Resources
- NVIDIA Dynamo Framework Guide
- Run:ai Distributed Inference Platform
- Run:ai Distributed Inference Tutorial
- NIM Multi-Node Deployment on K8s
- NIM Support Matrix
- The Inference Gold Rush
- NVIDIA GPU Operator on Kubernetes
- Official Blog: Run:ai + Dynamo
- KAI Scheduler (GitHub)
About the Author
I am Luca Berton, AI and Cloud Advisor. I architect multi-node GPU inference platforms with Run:ai and Dynamo for enterprise workloads. Book a consultation.