Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
LeaderWorkerSet distributed inference Kubernetes multi-node LLM
Platform Engineering

Multi-Node Distributed Inference with LeaderWorkerSet on

Deploy large language models across multiple Kubernetes nodes using the LeaderWorkerSet (LWS) operator. Covers vLLM pipeline parallelism, NCCL over.

LB
Luca Berton
Β· 2 min read

When a model is too large for a single node β€” think Llama 3.1 405B requiring 810 GB of GPU memory β€” you need multi-node inference. The LeaderWorkerSet (LWS) operator is the Kubernetes-native way to orchestrate these distributed inference deployments.

What is LeaderWorkerSet?

LWS is a Kubernetes operator (graduated from SIG-Apps) designed for workloads that need tightly-coupled multi-pod groups. Each β€œreplica” consists of:

  • 1 Leader pod β€” receives traffic, coordinates inference
  • N Worker pods β€” hold model shards, participate in tensor/pipeline parallelism
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LeaderWorkerSet "llama-405b"                    β”‚
β”‚                                                   β”‚
β”‚  Replica 0:                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚ Leader   β”‚  β”‚ Worker 0 β”‚  β”‚ Worker 1 β”‚       β”‚
β”‚  β”‚ (head)   │──│ (layers) │──│ (layers) β”‚       β”‚
β”‚  β”‚ 8Γ— GPU   β”‚  β”‚ 8Γ— GPU   β”‚  β”‚ 8Γ— GPU   β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚       β–²                                           β”‚
β”‚       β”‚ Service (traffic entry point)             β”‚
β”‚                                                   β”‚
β”‚  Replica 1: (for high availability)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚ Leader   β”‚  β”‚ Worker 0 β”‚  β”‚ Worker 1 β”‚       β”‚
β”‚  β”‚ 8Γ— GPU   β”‚  β”‚ 8Γ— GPU   β”‚  β”‚ 8Γ— GPU   β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installing the LWS Operator

# Install LeaderWorkerSet CRDs and controller
kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

# Verify
kubectl get pods -n lws-system

Deploying Llama 405B with vLLM

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llama-405b
  namespace: ai-inference
spec:
  replicas: 1  # Number of independent model instances
  leaderWorkerTemplate:
    size: 3  # 1 leader + 2 workers = 3 nodes total
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        containers:
          - name: vllm
            image: vllm/vllm-openai:v0.7.3
            ports:
              - containerPort: 8000
                name: http
            env:
              - name: VLLM_HOST_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
              # LWS injects these automatically
              - name: LWS_LEADER_ADDRESS
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-address']
            args:
              - --model=meta-llama/Llama-3.1-405B-Instruct
              - --tensor-parallel-size=8
              - --pipeline-parallel-size=3
              - --gpu-memory-utilization=0.92
              - --max-model-len=32768
              - --distributed-executor-backend=ray
              - --trust-remote-code
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/rdma_shared_device_a: 1
              requests:
                memory: "200Gi"
                cpu: "32"
            readinessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 300  # Model loading takes time
              periodSeconds: 10
            livenessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 600
              periodSeconds: 30
            volumeMounts:
              - name: model-cache
                mountPath: /root/.cache/huggingface
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-store
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: "64Gi"
    workerTemplate:
      spec:
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.7.3
            env:
              - name: VLLM_HOST_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
            args:
              - --model=meta-llama/Llama-3.1-405B-Instruct
              - --tensor-parallel-size=8
              - --pipeline-parallel-size=3
              - --gpu-memory-utilization=0.92
              - --distributed-executor-backend=ray
              - --trust-remote-code
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/rdma_shared_device_a: 1
              requests:
                memory: "200Gi"
                cpu: "32"
            volumeMounts:
              - name: model-cache
                mountPath: /root/.cache/huggingface
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-store
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: "64Gi"
---
# Service targets only the leader pods
apiVersion: v1
kind: Service
metadata:
  name: llama-405b
  namespace: ai-inference
spec:
  selector:
    leaderworkerset.sigs.k8s.io/name: llama-405b
    role: leader
  ports:
    - port: 8000
      targetPort: 8000
      name: http

NCCL Configuration for Multi-Node

For the pods to communicate efficiently, configure NCCL environment variables:

env:
  # Use InfiniBand for tensor parallel communication
  - name: NCCL_IB_HCA
    value: "mlx5"
  - name: NCCL_IB_GID_INDEX
    value: "3"
  - name: NCCL_SOCKET_IFNAME
    value: "net1"
  - name: NCCL_DEBUG
    value: "WARN"
  # Performance tuning
  - name: NCCL_IB_QPS_PER_CONNECTION
    value: "4"
  - name: NCCL_IB_TC
    value: "136"
  - name: NCCL_IB_SL
    value: "5"

Network Requirements

Multi-node inference requires high-bandwidth, low-latency networking:

# Network attachment for InfiniBand/RoCE
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: rdma-network
  namespace: ai-inference
  annotations:
    k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "net1",
      "ipam": {
        "type": "whereabouts",
        "range": "10.0.100.0/24"
      }
    }

Without RDMA networking, pipeline parallelism adds 10-50ms per pipeline stage β€” unacceptable for interactive inference.

Health Checks and Failure Recovery

LWS’s RecreateGroupOnPodRestart policy ensures that if any pod in the group fails, the entire group is recreated. This is critical because:

  1. Model shards are loaded at startup into specific ranks
  2. If one worker dies, the remaining pods hold stale NCCL communicators
  3. Partial recovery is impossible β€” the whole group must restart together
spec:
  leaderWorkerTemplate:
    restartPolicy: RecreateGroupOnPodRestart

Startup time for 405B (3 nodes Γ— 8 GPUs):

  • Model download from cache: ~2 minutes
  • Weight loading + sharding: ~5 minutes
  • NCCL ring initialization: ~30 seconds
  • Total cold start: ~8 minutes

Autoscaling with LWS

Scale model replicas based on request queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-405b-hpa
spec:
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: llama-405b
  minReplicas: 1
  maxReplicas: 4  # Up to 4 full model instances (12 nodes Γ— 8 GPUs = 96 GPUs)
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "10"  # Scale up when queue exceeds 10 requests
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling (loading time)
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 min before scaling down

LWS vs Alternatives

SolutionMulti-NodeK8s NativeHAAutoscaling
LWSβœ…βœ… (CRD)βœ… (replicas)βœ… (HPA)
Run:aiβœ…βœ… (operator)⚠️ (manual)⚠️ (scheduler)
Ray Serveβœ…βš οΈ (needs Ray cluster)βœ…βœ… (Ray autoscaler)
Manual StatefulSetβœ…βœ…βŒβŒ
NVIDIA NIM + Helmβœ…βœ…βœ…βš οΈ

LWS is the lightest-weight option that provides group semantics (all-or-nothing restart) without requiring a full platform like Ray or Run:ai.

Monitoring

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llama-405b-metrics
spec:
  selector:
    matchLabels:
      leaderworkerset.sigs.k8s.io/name: llama-405b
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Key metrics to watch:

  • vllm_num_requests_running β€” active inference requests
  • vllm_num_requests_waiting β€” queue depth (trigger for scaling)
  • vllm_gpu_cache_usage_perc β€” KV cache utilization
  • vllm_avg_generation_throughput_toks_per_s β€” output token rate

LWS solves the β€œhow do I keep these pods together” problem that StatefulSets and Deployments weren’t designed for. For multi-node inference, it’s the missing piece between raw Kubernetes and full ML platforms.

Free 30-min AI & Cloud consultation

Book Now