When a model is too large for a single node β think Llama 3.1 405B requiring 810 GB of GPU memory β you need multi-node inference. The LeaderWorkerSet (LWS) operator is the Kubernetes-native way to orchestrate these distributed inference deployments.
What is LeaderWorkerSet?
LWS is a Kubernetes operator (graduated from SIG-Apps) designed for workloads that need tightly-coupled multi-pod groups. Each βreplicaβ consists of:
- 1 Leader pod β receives traffic, coordinates inference
- N Worker pods β hold model shards, participate in tensor/pipeline parallelism
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LeaderWorkerSet "llama-405b" β
β β
β Replica 0: β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Leader β β Worker 0 β β Worker 1 β β
β β (head) ββββ (layers) ββββ (layers) β β
β β 8Γ GPU β β 8Γ GPU β β 8Γ GPU β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β² β
β β Service (traffic entry point) β
β β
β Replica 1: (for high availability) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Leader β β Worker 0 β β Worker 1 β β
β β 8Γ GPU β β 8Γ GPU β β 8Γ GPU β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββInstalling the LWS Operator
# Install LeaderWorkerSet CRDs and controller
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
# Verify
kubectl get pods -n lws-systemDeploying Llama 405B with vLLM
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: llama-405b
namespace: ai-inference
spec:
replicas: 1 # Number of independent model instances
leaderWorkerTemplate:
size: 3 # 1 leader + 2 workers = 3 nodes total
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.7.3
ports:
- containerPort: 8000
name: http
env:
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# LWS injects these automatically
- name: LWS_LEADER_ADDRESS
valueFrom:
fieldRef:
fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-address']
args:
- --model=meta-llama/Llama-3.1-405B-Instruct
- --tensor-parallel-size=8
- --pipeline-parallel-size=3
- --gpu-memory-utilization=0.92
- --max-model-len=32768
- --distributed-executor-backend=ray
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
requests:
memory: "200Gi"
cpu: "32"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # Model loading takes time
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 30
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-store
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerTemplate:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-worker
image: vllm/vllm-openai:v0.7.3
env:
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
args:
- --model=meta-llama/Llama-3.1-405B-Instruct
- --tensor-parallel-size=8
- --pipeline-parallel-size=3
- --gpu-memory-utilization=0.92
- --distributed-executor-backend=ray
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
requests:
memory: "200Gi"
cpu: "32"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-store
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
---
# Service targets only the leader pods
apiVersion: v1
kind: Service
metadata:
name: llama-405b
namespace: ai-inference
spec:
selector:
leaderworkerset.sigs.k8s.io/name: llama-405b
role: leader
ports:
- port: 8000
targetPort: 8000
name: httpNCCL Configuration for Multi-Node
For the pods to communicate efficiently, configure NCCL environment variables:
env:
# Use InfiniBand for tensor parallel communication
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_SOCKET_IFNAME
value: "net1"
- name: NCCL_DEBUG
value: "WARN"
# Performance tuning
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_IB_TC
value: "136"
- name: NCCL_IB_SL
value: "5"Network Requirements
Multi-node inference requires high-bandwidth, low-latency networking:
# Network attachment for InfiniBand/RoCE
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-network
namespace: ai-inference
annotations:
k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "host-device",
"device": "net1",
"ipam": {
"type": "whereabouts",
"range": "10.0.100.0/24"
}
}Without RDMA networking, pipeline parallelism adds 10-50ms per pipeline stage β unacceptable for interactive inference.
Health Checks and Failure Recovery
LWSβs RecreateGroupOnPodRestart policy ensures that if any pod in the group fails, the entire group is recreated. This is critical because:
- Model shards are loaded at startup into specific ranks
- If one worker dies, the remaining pods hold stale NCCL communicators
- Partial recovery is impossible β the whole group must restart together
spec:
leaderWorkerTemplate:
restartPolicy: RecreateGroupOnPodRestartStartup time for 405B (3 nodes Γ 8 GPUs):
- Model download from cache: ~2 minutes
- Weight loading + sharding: ~5 minutes
- NCCL ring initialization: ~30 seconds
- Total cold start: ~8 minutes
Autoscaling with LWS
Scale model replicas based on request queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama-405b-hpa
spec:
scaleTargetRef:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
name: llama-405b
minReplicas: 1
maxReplicas: 4 # Up to 4 full model instances (12 nodes Γ 8 GPUs = 96 GPUs)
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "10" # Scale up when queue exceeds 10 requests
behavior:
scaleUp:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling (loading time)
scaleDown:
stabilizationWindowSeconds: 600 # Wait 10 min before scaling downLWS vs Alternatives
| Solution | Multi-Node | K8s Native | HA | Autoscaling |
|---|---|---|---|---|
| LWS | β | β (CRD) | β (replicas) | β (HPA) |
| Run:ai | β | β (operator) | β οΈ (manual) | β οΈ (scheduler) |
| Ray Serve | β | β οΈ (needs Ray cluster) | β | β (Ray autoscaler) |
| Manual StatefulSet | β | β | β | β |
| NVIDIA NIM + Helm | β | β | β | β οΈ |
LWS is the lightest-weight option that provides group semantics (all-or-nothing restart) without requiring a full platform like Ray or Run:ai.
Monitoring
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llama-405b-metrics
spec:
selector:
matchLabels:
leaderworkerset.sigs.k8s.io/name: llama-405b
endpoints:
- port: http
path: /metrics
interval: 15sKey metrics to watch:
vllm_num_requests_runningβ active inference requestsvllm_num_requests_waitingβ queue depth (trigger for scaling)vllm_gpu_cache_usage_percβ KV cache utilizationvllm_avg_generation_throughput_toks_per_sβ output token rate
Related Articles
- Distributed vs Multi-GPU Inference β when to go multi-node
- NVIDIA NIM Multi-Node Deployment β NIM-specific approach
- GenAI-Perf Benchmarking β validating deployment performance
- Fine-Tuning Mistral with FSDP β training side of multi-node
- NCCL Timeout Troubleshooting β debugging NCCL issues
LWS solves the βhow do I keep these pods togetherβ problem that StatefulSets and Deployments werenβt designed for. For multi-node inference, itβs the missing piece between raw Kubernetes and full ML platforms.