Serving a 671B parameter model is not just a GPU problem. It is a scheduling problem, a networking problem, an autoscaling problem, and an observability problem β all at once.
NVIDIA Run:ai is a GPU orchestration platform that sits on top of Kubernetes and solves all of these for inference workloads. It handles single-node serverless serving through Knative and multi-node distributed inference through Leader-Worker Sets, with topology-aware scheduling that understands NVLink domains, InfiniBand fabrics, and GPU memory hierarchies.
What Run:ai Does for Inference
Run:ai adds an orchestration layer between your Kubernetes cluster and your inference workloads:
Clients
β
βΌ
Load Balancer β NGINX Ingress β TLS Termination
β
ββ Single-Node: β Kourier β Knative Queue Proxy β Model Pod
β
ββ Multi-Node: β Leader Pod βββ Worker Pod 0
β β Worker Pod 1
β β Worker Pod N
ββ Aggregates resultsThe platform supports three deployment patterns:
- NVIDIA NIM β optimized inference microservices from NGC catalog
- Hugging Face models β direct deployment from HF model hub
- Custom containers β any inference runtime (vLLM, TGI, Triton, etc.)
Single-Node: Knative Serverless
For models that fit on one node (under ~640 GB for 8ΓH100), Run:ai deploys through Knative Serving:
# Run:ai creates this automatically when you submit a single-node inference workload
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: llama-70b-inference
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "10"
spec:
containers:
- image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
resources:
limits:
nvidia.com/gpu: 8Knative provides:
- Scale to zero β no GPU cost when idle
- Request queuing β maintains SLOs under load
- Concurrency management β controls requests per replica
- Revision tracking β rollback to previous model versions
- Traffic splitting β canary deployments between model versions
The request flow passes through Kourier (Knativeβs ingress) to a Queue Proxy sidecar that manages concurrency before hitting the model container. This is what makes autoscaling responsive β the queue proxy reports real-time concurrency metrics that drive scaling decisions.
Multi-Node: Leader-Worker Sets
When a model exceeds single-node capacity, Run:ai deploys using Leader-Worker Sets (LWS) β a Kubernetes-native abstraction where one leader pod coordinates multiple worker pods:
Leader Pod (Node 0)
βββ Receives client requests
βββ Validates authorization
βββ Holds model layers 0-39 (TP=8 across 8 GPUs)
βββ Coordinates with workers via NCCL
βββ Aggregates and returns results
Worker Pod (Node 1)
βββ Holds model layers 40-79 (TP=8 across 8 GPUs)
βββ Participates in distributed computation
βββ No external network exposureKey differences from single-node:
- No Knative β LWS replaces Knative for multi-node workloads
- No scale to zero β multi-node models stay loaded (startup cost is too high)
- Gang scheduling β all pods scheduled together or not at all
- Direct ingress β requests go straight to the leader pod, no queue proxy
Submitting a Multi-Node Inference Workload
Via the Run:ai CLI:
runai inference submit deepseek-r1 \
--image nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3 \
--gpu 8 \
--num-nodes 2 \
--distributed \
--env NIM_TENSOR_PARALLEL_SIZE=8 \
--env NIM_PIPELINE_PARALLEL_SIZE=2 \
--large-shmOr via YAML:
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: deepseek-r1
namespace: runai-production
spec:
image:
value: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
compute:
gpuDevicesRequest: 8
distributed:
numNodes: 2
environment:
items:
NIM_TENSOR_PARALLEL_SIZE:
value: "8"
NIM_PIPELINE_PARALLEL_SIZE:
value: "2"
storage:
largeShm: trueRun:ai translates this into a LeaderWorkerSet with the correct NCCL configuration, shared memory allocation, and network topology placement.
Topology-Aware Scheduling
This is where Run:ai earns its keep. GPU clusters are not flat β they have hierarchies:
Cluster
βββ Rack 0 (InfiniBand switch)
β βββ Node 0 (8Γ H100, NVLink domain)
β βββ Node 1 (8Γ H100, NVLink domain)
βββ Rack 1 (InfiniBand switch)
β βββ Node 2 (8Γ H100, NVLink domain)
β βββ Node 3 (8Γ H100, NVLink domain)
βββ Spine switch (connects racks)Intra-node (NVLink): 900 GB/s Intra-rack (InfiniBand): 50 GB/s Cross-rack (InfiniBand via spine): 25-50 GB/s with added latency
Run:aiβs topology-aware scheduler:
- Detects NVLink domains β keeps tensor parallelism within NVLink-connected GPUs
- Detects InfiniBand fabric β places pipeline parallel stages on nodes sharing the same IB switch
- MNNVL awareness β for Multi-Node NVLink systems (like NVIDIA GB200 NVL72), schedules across the full NVLink domain spanning multiple nodes
- Avoids cross-rack placement when possible, minimizing communication overhead
Without topology awareness, a 2-node job could land on nodes in different racks, doubling inter-node latency. Run:ai prevents this automatically.
NVIDIA Dynamo Integration
Run:ai supports NVIDIA Dynamo β a distributed inference framework that decomposes models into graph-based pipelines:
Traditional: Monolithic model on N GPUs
Dynamo: Disaggregated pipeline stages
βββββββββββββββ βββββββββββββββββ ββββββββββββββββ
β Prefill βββββΆβ Decode βββββΆβ Post-processβ
β (GPU-heavy) β β (memory- β β (CPU/light) β
β β β bound) β β β
βββββββββββββββ βββββββββββββββββ ββββββββββββββββDynamo workloads are deployed as DynamoGraphDeployments β the Dynamo Operator manages the pipeline graph, and Run:ai handles scheduling each stage onto the right hardware.
Benefits over monolithic deployment:
- Disaggregated prefill/decode β prefill on high-compute GPUs, decode on memory-optimized GPUs
- Independent scaling β scale decode nodes without adding more prefill capacity
- Pipeline efficiency β overlap computation across stages
- Mixed hardware β use different GPU types for different pipeline stages
Dynamic Autoscaling
Run:ai autoscaling works differently for single-node and multi-node:
Single-Node (Knative-based)
# Autoscaling configuration
autoscaling:
minReplicas: 0 # Scale to zero when idle
maxReplicas: 16 # Maximum replicas
metric: "concurrency" # or "rps", "latency", "custom"
target: 10 # Target concurrent requests per replica
scaleDownDelay: 300 # 5 minutes before scale-downMetrics that trigger scaling:
- Concurrency β active requests per replica
- Requests per second β throughput-based
- Latency β p95 response time exceeds threshold
- Custom Prometheus metrics β any metric you expose
Multi-Node (Replica-based)
Multi-node workloads scale by adding complete replica groups (leader + workers together):
autoscaling:
minReplicas: 1 # Always keep at least 1 replica group
maxReplicas: 4 # Up to 4 complete 2-node replica groups
metric: "custom"
metricName: "inference_queue_depth"
target: 20Each replica is an independent multi-node deployment serving the full model. Scaling adds 16 GPUs (2 nodes) per replica.
Observability
Run:ai provides layered metrics for inference workloads:
Infrastructure Metrics (All Workloads)
- GPU utilization per device
- GPU memory utilization
- CPU and system memory usage
- Network throughput (inter-node NCCL traffic)
Inference Metrics (All Inference Workloads)
- Request throughput (requests/second)
- Request latency (p50, p95, p99)
- Active replica count
- Queue depth and wait time
NIM-Specific Metrics (NIM Workloads Only)
- Time to First Token (TTFT) β latency until first token generated
- Inter-Token Latency (ITL) β time between consecutive tokens
- Token throughput β tokens/second generated
- KV-cache utilization β GPU memory used for attention cache
- Request concurrency β active concurrent requests
- Prompt/completion token counts β input vs output token ratios
These metrics feed into Run:aiβs built-in dashboards and can be exported to Prometheus/Grafana for custom visualization.
Access Control
Run:ai enforces authentication on inference endpoints:
Client β Request with Bearer Token β Load Balancer β Ingress
β Authorization check (Run:ai control plane)
β Model pod (if authorized)Access can be scoped to:
- Public β no authentication required
- Authenticated users β any valid Run:ai user
- Specific groups β department or team-level access
- Service accounts β machine-to-machine with API keys
- User-specific β individual user restrictions
This is critical for multi-tenant GPU clusters where different teams serve different models on shared infrastructure.
Rolling Updates
Update inference workloads without downtime:
# Update model version
runai inference update deepseek-r1 \
--image nvcr.io/nim/deepseek-ai/deepseek-r1:1.8.0
# Update compute resources
runai inference update deepseek-r1 \
--gpu 8
# Update scaling policy
runai inference update deepseek-r1 \
--max-replicas 8For single-node (Knative), this creates a new revision with gradual traffic shift. For multi-node (LWS), it performs a rolling replacement of replica groups.
Model Catalog Integration
Run:ai integrates with two model sources:
NVIDIA NGC Catalog
- Dynamic model list updated from NGC
- Preconfigured NIM containers with optimized serving configs
- One-click deployment for supported models
Hugging Face Hub
- Browse and search the full HF catalog from Run:ai UI
- Gated model support with HF token authentication
- Automatic download and caching of model weights
# Deploy from Hugging Face
runai inference submit llama-8b \
--image ghcr.io/huggingface/text-generation-inference:latest \
--gpu 1 \
--env MODEL_ID=meta-llama/Llama-3.1-8B-Instruct \
--env HUGGING_FACE_HUB_TOKEN=hf_xxxWhen to Use Run:ai vs DIY Kubernetes
| Capability | DIY Kubernetes | Run:ai |
|---|---|---|
| Basic inference serving | β Manual setup | β Automated |
| Topology-aware scheduling | β Not built-in | β Automatic |
| Multi-node orchestration | β οΈ Manual LWS config | β Declarative |
| GPU sharing/fractional | β οΈ MIG only | β Dynamic fractions |
| Autoscaling to zero | β οΈ Knative setup | β Built-in |
| Multi-tenant quotas | β Basic ResourceQuota | β Hierarchical quotas |
| NIM-specific metrics | β Manual Prometheus | β Built-in dashboards |
| Gang scheduling | β Needs Volcano/Coscheduling | β Native |
| MNNVL support | β Manual | β Automatic |
Run:ai makes sense when you have:
- Multiple teams sharing GPU clusters
- Mix of training and inference workloads competing for GPUs
- Large models requiring multinode with topology awareness
- Need for enterprise access control and audit logging
DIY Kubernetes is fine for:
- Single-team GPU clusters
- Simple single-node inference
- Cost-sensitive environments (Run:ai has licensing costs)
Related Resources
- NVIDIA NIM Multinode Inference
- NVIDIA GPU Operator on Kubernetes
- Multi-Tenant GPUs on Bare Metal
- The Inference Gold Rush
- FinOps for AI: GPU Cost Optimization
- Autoscaling AI Inference on Kubernetes
- Slurm GPU Cluster Guide
- Kubernetes Gateway API
About the Author
I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises serving large language models at scale. Book a consultation to discuss your inference architecture.