Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
NVIDIA Run:ai Distributed Inference Platform Guide 2026
AI

Run:ai Distributed Inference: Large Model Serving Guide

NVIDIA Run:ai orchestrates distributed inference across GPU nodes with topology-aware scheduling, dynamic autoscaling, NIM support, and Dynamo pipelines.

LB
Luca Berton
Β· 5 min read

Serving a 671B parameter model is not just a GPU problem. It is a scheduling problem, a networking problem, an autoscaling problem, and an observability problem β€” all at once.

NVIDIA Run:ai is a GPU orchestration platform that sits on top of Kubernetes and solves all of these for inference workloads. It handles single-node serverless serving through Knative and multi-node distributed inference through Leader-Worker Sets, with topology-aware scheduling that understands NVLink domains, InfiniBand fabrics, and GPU memory hierarchies.

What Run:ai Does for Inference

Run:ai adds an orchestration layer between your Kubernetes cluster and your inference workloads:

Clients
   β”‚
   β–Ό
Load Balancer β†’ NGINX Ingress β†’ TLS Termination
   β”‚
   β”œβ”€ Single-Node: β†’ Kourier β†’ Knative Queue Proxy β†’ Model Pod
   β”‚
   └─ Multi-Node:  β†’ Leader Pod ──→ Worker Pod 0
                         β”‚          β†’ Worker Pod 1
                         β”‚          β†’ Worker Pod N
                         └─ Aggregates results

The platform supports three deployment patterns:

  • NVIDIA NIM β€” optimized inference microservices from NGC catalog
  • Hugging Face models β€” direct deployment from HF model hub
  • Custom containers β€” any inference runtime (vLLM, TGI, Triton, etc.)

Single-Node: Knative Serverless

For models that fit on one node (under ~640 GB for 8Γ—H100), Run:ai deploys through Knative Serving:

# Run:ai creates this automatically when you submit a single-node inference workload
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: llama-70b-inference
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"
        autoscaling.knative.dev/target: "10"
    spec:
      containers:
        - image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
          resources:
            limits:
              nvidia.com/gpu: 8

Knative provides:

  • Scale to zero β€” no GPU cost when idle
  • Request queuing β€” maintains SLOs under load
  • Concurrency management β€” controls requests per replica
  • Revision tracking β€” rollback to previous model versions
  • Traffic splitting β€” canary deployments between model versions

The request flow passes through Kourier (Knative’s ingress) to a Queue Proxy sidecar that manages concurrency before hitting the model container. This is what makes autoscaling responsive β€” the queue proxy reports real-time concurrency metrics that drive scaling decisions.

Multi-Node: Leader-Worker Sets

When a model exceeds single-node capacity, Run:ai deploys using Leader-Worker Sets (LWS) β€” a Kubernetes-native abstraction where one leader pod coordinates multiple worker pods:

Leader Pod (Node 0)
β”œβ”€β”€ Receives client requests
β”œβ”€β”€ Validates authorization
β”œβ”€β”€ Holds model layers 0-39 (TP=8 across 8 GPUs)
β”œβ”€β”€ Coordinates with workers via NCCL
└── Aggregates and returns results

Worker Pod (Node 1)
β”œβ”€β”€ Holds model layers 40-79 (TP=8 across 8 GPUs)
β”œβ”€β”€ Participates in distributed computation
└── No external network exposure

Key differences from single-node:

  • No Knative β€” LWS replaces Knative for multi-node workloads
  • No scale to zero β€” multi-node models stay loaded (startup cost is too high)
  • Gang scheduling β€” all pods scheduled together or not at all
  • Direct ingress β€” requests go straight to the leader pod, no queue proxy

Submitting a Multi-Node Inference Workload

Via the Run:ai CLI:

runai inference submit deepseek-r1 \
  --image nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3 \
  --gpu 8 \
  --num-nodes 2 \
  --distributed \
  --env NIM_TENSOR_PARALLEL_SIZE=8 \
  --env NIM_PIPELINE_PARALLEL_SIZE=2 \
  --large-shm

Or via YAML:

apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
  name: deepseek-r1
  namespace: runai-production
spec:
  image:
    value: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
  compute:
    gpuDevicesRequest: 8
  distributed:
    numNodes: 2
  environment:
    items:
      NIM_TENSOR_PARALLEL_SIZE:
        value: "8"
      NIM_PIPELINE_PARALLEL_SIZE:
        value: "2"
  storage:
    largeShm: true

Run:ai translates this into a LeaderWorkerSet with the correct NCCL configuration, shared memory allocation, and network topology placement.

Topology-Aware Scheduling

This is where Run:ai earns its keep. GPU clusters are not flat β€” they have hierarchies:

Cluster
β”œβ”€β”€ Rack 0 (InfiniBand switch)
β”‚   β”œβ”€β”€ Node 0 (8Γ— H100, NVLink domain)
β”‚   └── Node 1 (8Γ— H100, NVLink domain)
β”œβ”€β”€ Rack 1 (InfiniBand switch)
β”‚   β”œβ”€β”€ Node 2 (8Γ— H100, NVLink domain)
β”‚   └── Node 3 (8Γ— H100, NVLink domain)
└── Spine switch (connects racks)

Intra-node (NVLink): 900 GB/s Intra-rack (InfiniBand): 50 GB/s Cross-rack (InfiniBand via spine): 25-50 GB/s with added latency

Run:ai’s topology-aware scheduler:

  1. Detects NVLink domains β€” keeps tensor parallelism within NVLink-connected GPUs
  2. Detects InfiniBand fabric β€” places pipeline parallel stages on nodes sharing the same IB switch
  3. MNNVL awareness β€” for Multi-Node NVLink systems (like NVIDIA GB200 NVL72), schedules across the full NVLink domain spanning multiple nodes
  4. Avoids cross-rack placement when possible, minimizing communication overhead

Without topology awareness, a 2-node job could land on nodes in different racks, doubling inter-node latency. Run:ai prevents this automatically.

NVIDIA Dynamo Integration

Run:ai supports NVIDIA Dynamo β€” a distributed inference framework that decomposes models into graph-based pipelines:

Traditional: Monolithic model on N GPUs
Dynamo:      Disaggregated pipeline stages

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Prefill     │───▢│   Decode      │───▢│  Post-processβ”‚
β”‚  (GPU-heavy) β”‚    β”‚  (memory-     β”‚    β”‚  (CPU/light) β”‚
β”‚              β”‚    β”‚   bound)      β”‚    β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dynamo workloads are deployed as DynamoGraphDeployments β€” the Dynamo Operator manages the pipeline graph, and Run:ai handles scheduling each stage onto the right hardware.

Benefits over monolithic deployment:

  • Disaggregated prefill/decode β€” prefill on high-compute GPUs, decode on memory-optimized GPUs
  • Independent scaling β€” scale decode nodes without adding more prefill capacity
  • Pipeline efficiency β€” overlap computation across stages
  • Mixed hardware β€” use different GPU types for different pipeline stages

Dynamic Autoscaling

Run:ai autoscaling works differently for single-node and multi-node:

Single-Node (Knative-based)

# Autoscaling configuration
autoscaling:
  minReplicas: 0         # Scale to zero when idle
  maxReplicas: 16        # Maximum replicas
  metric: "concurrency"  # or "rps", "latency", "custom"
  target: 10             # Target concurrent requests per replica
  scaleDownDelay: 300    # 5 minutes before scale-down

Metrics that trigger scaling:

  • Concurrency β€” active requests per replica
  • Requests per second β€” throughput-based
  • Latency β€” p95 response time exceeds threshold
  • Custom Prometheus metrics β€” any metric you expose

Multi-Node (Replica-based)

Multi-node workloads scale by adding complete replica groups (leader + workers together):

autoscaling:
  minReplicas: 1         # Always keep at least 1 replica group
  maxReplicas: 4         # Up to 4 complete 2-node replica groups
  metric: "custom"
  metricName: "inference_queue_depth"
  target: 20

Each replica is an independent multi-node deployment serving the full model. Scaling adds 16 GPUs (2 nodes) per replica.

Observability

Run:ai provides layered metrics for inference workloads:

Infrastructure Metrics (All Workloads)

  • GPU utilization per device
  • GPU memory utilization
  • CPU and system memory usage
  • Network throughput (inter-node NCCL traffic)

Inference Metrics (All Inference Workloads)

  • Request throughput (requests/second)
  • Request latency (p50, p95, p99)
  • Active replica count
  • Queue depth and wait time

NIM-Specific Metrics (NIM Workloads Only)

  • Time to First Token (TTFT) β€” latency until first token generated
  • Inter-Token Latency (ITL) β€” time between consecutive tokens
  • Token throughput β€” tokens/second generated
  • KV-cache utilization β€” GPU memory used for attention cache
  • Request concurrency β€” active concurrent requests
  • Prompt/completion token counts β€” input vs output token ratios

These metrics feed into Run:ai’s built-in dashboards and can be exported to Prometheus/Grafana for custom visualization.

Access Control

Run:ai enforces authentication on inference endpoints:

Client β†’ Request with Bearer Token β†’ Load Balancer β†’ Ingress
  β†’ Authorization check (Run:ai control plane)
  β†’ Model pod (if authorized)

Access can be scoped to:

  • Public β€” no authentication required
  • Authenticated users β€” any valid Run:ai user
  • Specific groups β€” department or team-level access
  • Service accounts β€” machine-to-machine with API keys
  • User-specific β€” individual user restrictions

This is critical for multi-tenant GPU clusters where different teams serve different models on shared infrastructure.

Rolling Updates

Update inference workloads without downtime:

# Update model version
runai inference update deepseek-r1 \
  --image nvcr.io/nim/deepseek-ai/deepseek-r1:1.8.0

# Update compute resources
runai inference update deepseek-r1 \
  --gpu 8

# Update scaling policy
runai inference update deepseek-r1 \
  --max-replicas 8

For single-node (Knative), this creates a new revision with gradual traffic shift. For multi-node (LWS), it performs a rolling replacement of replica groups.

Model Catalog Integration

Run:ai integrates with two model sources:

NVIDIA NGC Catalog

  • Dynamic model list updated from NGC
  • Preconfigured NIM containers with optimized serving configs
  • One-click deployment for supported models

Hugging Face Hub

  • Browse and search the full HF catalog from Run:ai UI
  • Gated model support with HF token authentication
  • Automatic download and caching of model weights
# Deploy from Hugging Face
runai inference submit llama-8b \
  --image ghcr.io/huggingface/text-generation-inference:latest \
  --gpu 1 \
  --env MODEL_ID=meta-llama/Llama-3.1-8B-Instruct \
  --env HUGGING_FACE_HUB_TOKEN=hf_xxx

When to Use Run:ai vs DIY Kubernetes

CapabilityDIY KubernetesRun:ai
Basic inference servingβœ… Manual setupβœ… Automated
Topology-aware scheduling❌ Not built-inβœ… Automatic
Multi-node orchestration⚠️ Manual LWS configβœ… Declarative
GPU sharing/fractional⚠️ MIG onlyβœ… Dynamic fractions
Autoscaling to zero⚠️ Knative setupβœ… Built-in
Multi-tenant quotas❌ Basic ResourceQuotaβœ… Hierarchical quotas
NIM-specific metrics❌ Manual Prometheusβœ… Built-in dashboards
Gang scheduling❌ Needs Volcano/Coschedulingβœ… Native
MNNVL support❌ Manualβœ… Automatic

Run:ai makes sense when you have:

  • Multiple teams sharing GPU clusters
  • Mix of training and inference workloads competing for GPUs
  • Large models requiring multinode with topology awareness
  • Need for enterprise access control and audit logging

DIY Kubernetes is fine for:

  • Single-team GPU clusters
  • Simple single-node inference
  • Cost-sensitive environments (Run:ai has licensing costs)

About the Author

I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises serving large language models at scale. Book a consultation to discuss your inference architecture.

Free 30-min AI & Cloud consultation

Book Now