When a distributed workload spans multiple nodes, where those nodes sit in the network determines performance. Pods in the same rack communicate over high-speed switches. Pods across racks hit bandwidth bottlenecks. Pods across zones add milliseconds of latency that compound across thousands of gradient synchronization steps.
NVIDIA Run:aiβs topology-aware scheduling solves this by using knowledge of your clusterβs physical network layout to place workload pods on nodes that minimize communication overhead.
Why Topology Matters
A modern GPU cluster has a hierarchy:
Region (eu-west)
βββ Zone (eu-west-1a)
βββ Block (block-3)
βββ Rack (rack-42)
βββ Node (gpu-node-07)
βββ GPU (NVLink domain)Each level has different bandwidth and latency characteristics:
| Level | Interconnect | Bandwidth | Latency |
|---|---|---|---|
| Same NVLink domain | NVLink | 900 GB/s (NVL72) | Nanoseconds |
| Same rack | InfiniBand HDR/NDR | 200-400 Gb/s | Microseconds |
| Same block | Spine switch | 100-400 Gb/s | Low milliseconds |
| Cross-rack | Leaf-spine fabric | Shared | Higher milliseconds |
| Cross-zone | WAN or provider fabric | Variable | Significant |
A workload placed entirely within one rack gets 10-100x better inter-node bandwidth than one scattered across racks. For distributed training with all-reduce operations or disaggregated inference with NIXL transfers, this difference is enormous.
How It Works
1. Label Your Nodes
Topology-aware scheduling relies on Kubernetes node labels that describe physical location. These can be:
- Manual: Apply labels with
kubectl - Automatic (cloud): Cloud providers often set zone/region labels automatically
- Discovery tools: NVIDIA Topograph can detect topology from hardware, optionally integrating with NVIDIA NetQ for on-prem environments
# Manual labeling example
kubectl label node gpu-node-01 \
topology.kubernetes.io/region=eu-west \
topology.kubernetes.io/zone=eu-west-1a \
cloud.provider.com/topology-block=block-3 \
cloud.provider.com/topology-rack=rack-422. Define Network Topology in Run:ai
Create a topology that maps label keys from farthest to closest:
{
"levels": [
"topology.kubernetes.io/region",
"topology.kubernetes.io/zone",
"cloud.provider.com/topology-block",
"cloud.provider.com/topology-rack",
"kubernetes.io/hostname"
],
"name": "default-topology",
"clusterId": "<CLUSTER_ID>"
}Order matters: First label = farthest (region), last label = closest (hostname).
3. Attach Topology to Node Pools
- Single node pool: Attach topology to the default pool
- Multiple pools with same topology: Link the same topology to all pools
- Different hardware, different topology: Each pool gets its own topology definition
4. Automatic Scheduling
Once configured, distributed workloads automatically get topology-aware placement. No annotations needed. Run:ai applies a Preferred constraint at the lowest topology level and escalates upward if placement is not possible:
Try: Same hostname β Same rack β Same block β Same zone β Same regionFine-Tuning Placement per Workload
Override the automatic behavior with annotations:
Preferred (Soft Constraint)
Best-effort co-location. Scheduler tries the specified level but relaxes if needed:
metadata:
annotations:
kai.scheduler/topology: "default-topology"
kai.scheduler/topology-preferred-placement: "cloud.provider.com/topology-rack"Required (Hard Constraint)
Strict placement. All pods must be in the same topology level or the workload waits:
metadata:
annotations:
kai.scheduler/topology: "default-topology"
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"Combined (Required + Preferred)
Enforce a broad constraint while preferring a tighter one:
metadata:
annotations:
kai.scheduler/topology: "default-topology"
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology-preferred-placement: "cloud.provider.com/topology-rack"This means: all pods must be in the same zone, and the scheduler tries to put them in the same rack within that zone.
Important: A Preferred constraint at the same level or higher than Required has no effect. Preferred only works at a lower (more specific) level.
LeaderWorkerSet Behavior
For LeaderWorkerSet workloads (used by NIM multi-node, Dynamo), topology-aware scheduling is applied per replica:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
annotations:
kai.scheduler/topology: "cluster-topology"
kai.scheduler/topology-required-placement: "cloud.provider.com/topology-block"
kai.scheduler/topology-preferred-placement: "cloud.provider.com/topology-rack"
labels:
runai/queue: production
namespace: runai-production
name: llm-inference
spec:
replicas: 2
leaderWorkerTemplate:
size: 2Each replica (leader + workers) is:
- Gang scheduled β all pods in a replica launch together
- Topology constrained β leader and worker pods land on the same rack/block
With 2 replicas of size 2, you get 4 pods total. Each pair of 2 is co-located, but the two replicas may be on different racks.
Pod Affinity vs Topology-Aware Scheduling
Standard Kubernetes pod affinity has a fundamental flaw for distributed workloads: it places pods one by one, checking closeness to already-placed pods without seeing the full picture.
Example: Two racks, some nodes already occupied. A workload needs 6 nodes.
| Approach | Result |
|---|---|
| Pod Affinity | Starts placing in Rack A, runs out of space, splits to Rack B. Cross-rack communication. |
| Topology-Aware | Evaluates full 6-node requirement upfront, finds Rack A has enough space, places all 6 together. |
Topology-aware scheduling evaluates the entire workload against available resources before making any placement decision. Pod affinity is greedy and local.
Practical Recommendations
For Distributed Training
# Strict same-rack, all GPUs need high-bandwidth all-reduce
annotations:
kai.scheduler/topology: "cluster-topology"
kai.scheduler/topology-required-placement: "cloud.provider.com/topology-rack"For Disaggregated Inference (Dynamo)
# Same zone required, same rack preferred
# Prefill and decode need fast NIXL transfers but can tolerate rack-level latency
annotations:
kai.scheduler/topology: "cluster-topology"
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology-preferred-placement: "cloud.provider.com/topology-rack"For GB200 NVL72
For multi-node NVLink domains, see the GB200 and Multi-Node NVLink guide. The topology must reflect NVLink domain boundaries.
Known Limitations
- Detaching a topology from a node pool does not affect running workloads (they continue using it)
- Deleting a topology while workloads reference it: running workloads continue, but suspended/unbound workloads become unschedulable (Pending)
- Submitting a workload to multiple node pools with different topologies is not supported β workloads will fail or remain pending
Related Resources
- Run:ai + Dynamo Integration
- NVIDIA Dynamo Framework
- Run:ai Distributed Inference Platform
- NIM Multi-Node Deployment on K8s
- NVIDIA GPU Operator on Kubernetes
- Multi-Tenant GPUs on Bare Metal
- NVIDIA Topograph (GitHub)
- Official Docs: Topology-Aware Scheduling
About the Author
I am Luca Berton, AI and Cloud Advisor. I design GPU cluster topologies and scheduling strategies for distributed AI workloads. Book a consultation.