Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Fine-Tune LLMs on Kubernetes: Distributed Training Guide
AI

Fine-Tune LLMs on Kubernetes: Distributed Training Guide

Run distributed LLM fine-tuning on Kubernetes with PyTorch FSDP, DeepSpeed, and multi-node GPU training. Job scheduling, checkpointing, and cost management.

LB
Luca Berton
Β· 1 min read

When to Fine-Tune

ScenarioFine-Tune?Alternative
Domain-specific terminologyβœ… YesRAG (if just lookup)
Output format/styleβœ… YesSystem prompt
New knowledge (facts)❌ NoRAG
Reduce hallucinationβœ… YesRAG + guardrails
Smaller model + same qualityβœ… YesDistillation
Cost reduction (fewer tokens)βœ… YesShorter prompts

Distributed Training on Kubernetes

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Training Job                      β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  Worker 0 β”‚ β”‚  Worker 1 β”‚ β”‚  Worker 2 β”‚     β”‚
β”‚  β”‚  (Rank 0) β”‚ β”‚  (Rank 1) β”‚ β”‚  (Rank 2) β”‚     β”‚
β”‚  β”‚  4x H100  β”‚ β”‚  4x H100  β”‚ β”‚  4x H100  β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β”‚
β”‚        β”‚              β”‚              β”‚           β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                       β”‚                          β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚              β”‚   NCCL / RDMA   β”‚                 β”‚
β”‚              β”‚  (AllReduce)    β”‚                 β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚         Shared Storage (NFS/Lustre)       β”‚   β”‚
β”‚  β”‚  β€’ Model checkpoints                     β”‚   β”‚
β”‚  β”‚  β€’ Training data                         β”‚   β”‚
β”‚  β”‚  β€’ Logs                                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

PyTorch FSDP on Kubernetes (Training Operator)

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llama-finetune
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: trainer
              image: myregistry/llm-trainer:latest
              command:
                - torchrun
                - "--nproc_per_node=4"
                - "--nnodes=3"
                - "--node_rank=0"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=29500"
                - "train.py"
                - "--model_name=meta-llama/Llama-3.1-8B"
                - "--dataset=my-domain-data"
                - "--output_dir=/checkpoints/llama-8b-ft"
                - "--fsdp_strategy=FULL_SHARD"
                - "--bf16=true"
                - "--per_device_train_batch_size=4"
                - "--gradient_accumulation_steps=8"
                - "--learning_rate=2e-5"
                - "--num_train_epochs=3"
                - "--save_steps=500"
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/rdma_shared_device_a: "1"
              volumeMounts:
                - name: checkpoints
                  mountPath: /checkpoints
                - name: data
                  mountPath: /data
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "64Gi"
            - name: checkpoints
              persistentVolumeClaim:
                claimName: training-checkpoints
            - name: data
              persistentVolumeClaim:
                claimName: training-data
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - name: trainer
              image: myregistry/llm-trainer:latest
              command:
                - torchrun
                - "--nproc_per_node=4"
                - "--nnodes=3"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=29500"
                - "train.py"
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/rdma_shared_device_a: "1"

LoRA Fine-Tuning (Single Node)

For most use cases, LoRA is more practical than full fine-tuning:

apiVersion: batch/v1
kind: Job
metadata:
  name: lora-finetune
spec:
  template:
    spec:
      containers:
        - name: trainer
          image: myregistry/llm-trainer:latest
          command:
            - python
            - train_lora.py
            - "--model_name=meta-llama/Llama-3.1-70B-Instruct"
            - "--dataset=/data/my-domain.jsonl"
            - "--output_dir=/checkpoints/lora-adapter"
            - "--lora_r=16"
            - "--lora_alpha=32"
            - "--lora_target_modules=q_proj,k_proj,v_proj,o_proj"
            - "--per_device_train_batch_size=4"
            - "--gradient_accumulation_steps=4"
            - "--bf16=true"
            - "--num_train_epochs=3"
            - "--learning_rate=1e-4"
          resources:
            limits:
              nvidia.com/gpu: "2"  # 70B LoRA fits on 2x A100

Training Script (train_lora.py)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")
# Output: trainable params: 83,886,080 || all params: 70,553,706,496 || 0.12%

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="/checkpoints/lora-adapter",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=1e-4,
        bf16=True,
        save_steps=500,
        logging_steps=10,
    ),
    max_seq_length=2048,
)

trainer.train()
trainer.save_model()

Cost Management

Spot Instances for Training

# Karpenter NodePool for training
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: training-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["p4d.24xlarge", "p5.48xlarge"]
  disruption:
    consolidationPolicy: WhenEmpty

Checkpointing for Preemption Recovery

# Save checkpoint every N steps
training_args = TrainingArguments(
    save_steps=100,
    save_total_limit=3,
    resume_from_checkpoint=True,  # Auto-resume after spot interruption
)

Cost Comparison

MethodGPU HoursCost (spot)Quality
Full fine-tune 8B24h Γ— 4 GPU$120Best
LoRA 8B8h Γ— 1 GPU$895% of full
QLoRA 70B16h Γ— 2 GPU$3290% of full
Full fine-tune 70B72h Γ— 8 GPU$1,440Best
LoRA 70B24h Γ— 2 GPU$4895% of full

LoRA is 10-30x cheaper than full fine-tuning with minimal quality loss.

Serving Fine-Tuned Models

vLLM with LoRA Adapter

containers:
  - name: vllm
    args:
      - "--model"
      - "meta-llama/Llama-3.1-70B-Instruct"
      - "--enable-lora"
      - "--lora-modules"
      - "my-domain=/checkpoints/lora-adapter"
      - "--max-loras"
      - "4"  # Serve multiple LoRA adapters simultaneously

Request specific adapter:

curl http://vllm:8000/v1/chat/completions \
  -d '{"model": "my-domain", "messages": [...]}'

Free 30-min AI & Cloud consultation

Book Now