Skip to main content
🎓 Claude Code Masterclass Learn AI-assisted development on Udemy — plus the companion book on Leanpub & Amazon. Start Learning
Unsloth on NVIDIA DGX Spark fine-tuning guide with CUDA 13 and PyTorch
AI

Unsloth on NVIDIA DGX Spark: 2x Faster

Step-by-step guide to running Unsloth on NVIDIA DGX Spark for 2x faster LLM fine-tuning with CUDA 13 and PyTorch containers.

LB
Luca Berton
· 5 min read

Why Unsloth on DGX Spark

Fine-tuning large language models is expensive — in time, in compute, and in patience. Unsloth changes the equation by delivering 2x faster fine-tuning with significantly less memory usage, making it possible to fine-tune models like Llama 3.1 8B on hardware that would otherwise be too constrained.

The NVIDIA DGX Spark is NVIDIA’s compact AI workstation — a desktop-class device with serious GPU capability. Combining it with Unsloth’s optimized training kernels gives you a local fine-tuning environment that punches well above its weight class.

This guide walks through the complete setup in about an hour.

Prerequisites

Before starting, verify your DGX Spark has the required CUDA toolkit and GPU resources.

Check CUDA version:

nvcc --version

The output should show CUDA 13.0 or later. DGX Spark ships with the CUDA toolkit pre-installed.

Check GPU status:

nvidia-smi

You should see a summary of your GPU information — device name, driver version, memory usage, and temperature. If either command fails, ensure your NVIDIA drivers are properly installed before continuing.

Step 1: Pull the PyTorch container

NVIDIA’s NGC container registry provides optimized PyTorch containers with the correct CUDA libraries, cuDNN, and NCCL pre-configured. Using the official container avoids dependency conflicts:

docker pull nvcr.io/nvidia/pytorch:25.11-py3

This is a large image (15-20 GB). On a fast connection it takes a few minutes; on slower connections, plan accordingly.

Step 2: Launch the container

Start an interactive session with full GPU access:

docker run --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -it \
  --entrypoint /usr/bin/bash \
  --rm \
  nvcr.io/nvidia/pytorch:25.11-py3

Breaking down the flags:

  • --gpus all — exposes all GPUs to the container
  • --ulimit memlock=-1 — removes memory lock limits (required for large model training)
  • --ulimit stack=67108864 — sets a 64 MB stack size (prevents stack overflow during training)
  • -it — interactive terminal
  • --entrypoint /usr/bin/bash — starts a shell instead of the default entrypoint
  • --rm — automatically removes the container when you exit

Step 3: Install dependencies

Inside the container, install the required Python packages:

pip install transformers peft hf_transfer "datasets==4.3.0" "trl==0.26.1"
pip install --no-deps unsloth unsloth_zoo bitsandbytes

Important: Unsloth and its dependencies are installed with --no-deps to avoid overwriting the optimized PyTorch and CUDA libraries already present in the NGC container. This is intentional — the container’s pre-built libraries are tuned for NVIDIA hardware.

What each package does

PackagePurpose
transformersHugging Face model loading and tokenization
peftParameter-Efficient Fine-Tuning (LoRA, QLoRA)
hf_transferFast model downloads from Hugging Face Hub
datasetsDataset loading and preprocessing
trlTransformer Reinforcement Learning — includes SFTTrainer
unslothOptimized training kernels for 2x speedup
unsloth_zooModel patches and optimizations
bitsandbytes4-bit and 8-bit quantization for memory efficiency

Step 4: Download the validation script

NVIDIA provides a test script to verify the installation:

curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/unsloth/assets/test_unsloth.py

This script runs a simple fine-tuning task to confirm everything is wired up correctly — Unsloth patches, GPU access, dataset loading, and training loop.

Step 5: Run the validation

python test_unsloth.py

Expected output:

  1. "Unsloth: Will patch your computer to enable 2x faster free finetuning" — confirms Unsloth’s kernel patches are active
  2. Training progress bars showing loss decreasing over 60 steps
  3. Final training metrics showing completion

If you see all three, your environment is ready for production fine-tuning.

What the test script does

The validation script:

  • Loads a small quantized model (typically a 4-bit Llama variant)
  • Applies LoRA adapters using Unsloth’s optimized patching
  • Runs a short fine-tuning job (60 steps) on a sample dataset
  • Reports training loss and throughput metrics

Troubleshooting

CUDA out of memory

If you see CUDA out of memory errors:

# Check GPU memory usage
nvidia-smi

# Reduce batch size in the script
# Edit test_unsloth.py, change per_device_train_batch_size to 1

Docker GPU not detected

If nvidia-smi works on the host but not inside the container:

# Ensure the NVIDIA Container Toolkit is installed
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Pip install conflicts

If pip install fails with dependency conflicts, ensure you are running inside the NGC container (not the host system) and using --no-deps for Unsloth:

# Verify you are inside the container
cat /etc/os-release  # Should show the NGC base image

# Reinstall with --no-deps
pip install --no-deps --force-reinstall unsloth unsloth_zoo bitsandbytes

Slow model download

If model downloads from Hugging Face are slow:

# Enable fast transfers
export HF_HUB_ENABLE_HF_TRANSFER=1

The hf_transfer package was installed in Step 3 specifically for this — it uses multi-threaded downloads for significantly faster model pulls.

Next steps: Fine-tune your own model

Once validation passes, customize the script for your use case:

Change the model

# Replace the model in test_unsloth.py (line 32)
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"

Popular choices for DGX Spark:

  • unsloth/Meta-Llama-3.1-8B-bnb-4bit — Meta’s Llama 3.1, 4-bit quantized
  • unsloth/Mistral-7B-v0.3-bnb-4bit — Mistral 7B, good for code and reasoning
  • unsloth/Phi-3.5-mini-instruct-bnb-4bit — Microsoft’s compact model, fast to fine-tune

Use your own dataset

# Load your custom dataset (line 8)
dataset = load_dataset("your_dataset_name")

# Or load from a local JSON/CSV file
dataset = load_dataset("json", data_files="your_data.json")

Adjust training parameters

# Training arguments (line 61)
per_device_train_batch_size = 4   # Increase for faster training (if memory allows)
max_steps = 1000                   # More steps for larger datasets
learning_rate = 2e-4               # Default works well for most tasks
warmup_steps = 10                  # Warm up the learning rate

Save in GGUF format

For deployment with vLLM or local inference engines like llama.cpp:

# Save as GGUF after training
model.save_pretrained_gguf("output_model", tokenizer, quantization_method="q4_k_m")

Resume from checkpoints

For long training runs, save and resume from checkpoints:

# In TrainingArguments
save_steps = 100
save_total_limit = 3

# Resume training
trainer.train(resume_from_checkpoint=True)

How Unsloth achieves 2x speedup

Unsloth’s performance gains come from several optimizations:

  • Custom CUDA kernels — hand-optimized attention and MLP kernels that reduce memory copies
  • Intelligent gradient checkpointing — recomputes activations selectively instead of storing everything
  • Optimized LoRA implementation — fused operations that reduce kernel launch overhead
  • Memory-efficient backpropagation — reduces peak memory usage by 50-70%

These optimizations are applied automatically when you load a model through Unsloth’s API. No code changes needed beyond using FastLanguageModel instead of the standard Hugging Face loader.

Why this matters for AI infrastructure

The DGX Spark + Unsloth combination is significant for teams building AI capabilities:

  • Local fine-tuning removes the dependency on cloud GPU instances for experimentation
  • 2x speedup means faster iteration cycles — critical for the first 90 days of AI platform development
  • 4-bit quantization makes it practical to fine-tune models that would otherwise require much larger GPU memory
  • Reproducible container environment ensures consistency across team members

For production deployment at scale, the fine-tuned models can be exported and served on Kubernetes GPU infrastructure using vLLM, Triton Inference Server, or TGI.

Resources


Related: GPU Sharing on Kubernetes: MIG, MPS, Time-Slicing, Hidden Cost Drivers in AI Workloads, KubeCon 2026: AI Industrialization. Need help with AI infrastructure? Book a consultation.

Free 30-min AI & Cloud consultation

Book Now