When to Fine-Tune
| Scenario | Fine-Tune? | Alternative |
|---|---|---|
| Domain-specific terminology | β Yes | RAG (if just lookup) |
| Output format/style | β Yes | System prompt |
| New knowledge (facts) | β No | RAG |
| Reduce hallucination | β Yes | RAG + guardrails |
| Smaller model + same quality | β Yes | Distillation |
| Cost reduction (fewer tokens) | β Yes | Shorter prompts |
Distributed Training on Kubernetes
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Training Job β
β β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β Worker 0 β β Worker 1 β β Worker 2 β β
β β (Rank 0) β β (Rank 1) β β (Rank 2) β β
β β 4x H100 β β 4x H100 β β 4x H100 β β
β βββββββ¬ββββββ βββββββ¬ββββββ βββββββ¬ββββββ β
β β β β β
β ββββββββββββββββΌβββββββββββββββ β
β β β
β ββββββββββΌβββββββββ β
β β NCCL / RDMA β β
β β (AllReduce) β β
β βββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β Shared Storage (NFS/Lustre) β β
β β β’ Model checkpoints β β
β β β’ Training data β β
β β β’ Logs β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββPyTorch FSDP on Kubernetes (Training Operator)
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llama-finetune
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: trainer
image: myregistry/llm-trainer:latest
command:
- torchrun
- "--nproc_per_node=4"
- "--nnodes=3"
- "--node_rank=0"
- "--master_addr=$(MASTER_ADDR)"
- "--master_port=29500"
- "train.py"
- "--model_name=meta-llama/Llama-3.1-8B"
- "--dataset=my-domain-data"
- "--output_dir=/checkpoints/llama-8b-ft"
- "--fsdp_strategy=FULL_SHARD"
- "--bf16=true"
- "--per_device_train_batch_size=4"
- "--gradient_accumulation_steps=8"
- "--learning_rate=2e-5"
- "--num_train_epochs=3"
- "--save_steps=500"
resources:
limits:
nvidia.com/gpu: "4"
rdma/rdma_shared_device_a: "1"
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
- name: data
mountPath: /data
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
- name: checkpoints
persistentVolumeClaim:
claimName: training-checkpoints
- name: data
persistentVolumeClaim:
claimName: training-data
Worker:
replicas: 2
template:
spec:
containers:
- name: trainer
image: myregistry/llm-trainer:latest
command:
- torchrun
- "--nproc_per_node=4"
- "--nnodes=3"
- "--master_addr=$(MASTER_ADDR)"
- "--master_port=29500"
- "train.py"
resources:
limits:
nvidia.com/gpu: "4"
rdma/rdma_shared_device_a: "1"LoRA Fine-Tuning (Single Node)
For most use cases, LoRA is more practical than full fine-tuning:
apiVersion: batch/v1
kind: Job
metadata:
name: lora-finetune
spec:
template:
spec:
containers:
- name: trainer
image: myregistry/llm-trainer:latest
command:
- python
- train_lora.py
- "--model_name=meta-llama/Llama-3.1-70B-Instruct"
- "--dataset=/data/my-domain.jsonl"
- "--output_dir=/checkpoints/lora-adapter"
- "--lora_r=16"
- "--lora_alpha=32"
- "--lora_target_modules=q_proj,k_proj,v_proj,o_proj"
- "--per_device_train_batch_size=4"
- "--gradient_accumulation_steps=4"
- "--bf16=true"
- "--num_train_epochs=3"
- "--learning_rate=1e-4"
resources:
limits:
nvidia.com/gpu: "2" # 70B LoRA fits on 2x A100Training Script (train_lora.py)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")
# Output: trainable params: 83,886,080 || all params: 70,553,706,496 || 0.12%
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
output_dir="/checkpoints/lora-adapter",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=1e-4,
bf16=True,
save_steps=500,
logging_steps=10,
),
max_seq_length=2048,
)
trainer.train()
trainer.save_model()Cost Management
Spot Instances for Training
# Karpenter NodePool for training
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: training-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["p4d.24xlarge", "p5.48xlarge"]
disruption:
consolidationPolicy: WhenEmptyCheckpointing for Preemption Recovery
# Save checkpoint every N steps
training_args = TrainingArguments(
save_steps=100,
save_total_limit=3,
resume_from_checkpoint=True, # Auto-resume after spot interruption
)Cost Comparison
| Method | GPU Hours | Cost (spot) | Quality |
|---|---|---|---|
| Full fine-tune 8B | 24h Γ 4 GPU | $120 | Best |
| LoRA 8B | 8h Γ 1 GPU | $8 | 95% of full |
| QLoRA 70B | 16h Γ 2 GPU | $32 | 90% of full |
| Full fine-tune 70B | 72h Γ 8 GPU | $1,440 | Best |
| LoRA 70B | 24h Γ 2 GPU | $48 | 95% of full |
LoRA is 10-30x cheaper than full fine-tuning with minimal quality loss.
Serving Fine-Tuned Models
vLLM with LoRA Adapter
containers:
- name: vllm
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--enable-lora"
- "--lora-modules"
- "my-domain=/checkpoints/lora-adapter"
- "--max-loras"
- "4" # Serve multiple LoRA adapters simultaneouslyRequest specific adapter:
curl http://vllm:8000/v1/chat/completions \
-d '{"model": "my-domain", "messages": [...]}'