You fine-tuned a model. Now you need to serve it. NVIDIA NIM’s model-free mode lets you deploy any supported model using a single generic container — no model-specific NIM image required.
One container image passes security review. It serves Llama, Mistral, your fine-tuned variant, or any model with a vLLM-supported architecture. Point it at HuggingFace, S3, NGC, or a local directory.
When to Use Model-Free NIM
| Scenario | Model-Free NIM | Model-Specific NIM |
|---|---|---|
| Fine-tuned custom model | ✅ Best choice | ❌ Not available |
| Day-zero newly released model | ✅ Immediate | ❌ Wait for NVIDIA to publish |
| One container for all models | ✅ Single image | ❌ Different image per model |
| Pre-optimized TensorRT-LLM | ❌ vLLM only | ✅ May include TRT-LLM profiles |
| NGC-curated profiles | ❌ Generic profiles | ✅ Pre-validated for specific hardware |
Key limitation: Model-free NIM uses vLLM as the backend. If a model architecture is unsupported by vLLM, it will not work with model-free NIM either.
Step 1: Set Up the NIM Container Image
Pull the generic NIM LLM container:
export NIM_LLM_IMAGE=nvcr.io/nim/nim-llm:latest
docker pull $NIM_LLM_IMAGECreate a local cache directory to avoid re-downloading models on restart:
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHEStep 2: Choose Your Model Source
Model-free NIM supports six model sources:
| Prefix | Source | Example | Authentication |
|---|---|---|---|
hf:// | HuggingFace Hub | hf://meta-llama/Llama-3.1-8B-Instruct | HF_TOKEN |
ngc:// | NVIDIA NGC | ngc://nim/meta/llama-3.3-70b-instruct:hf | NGC_API_KEY |
s3:// | AWS S3 / S3-compatible | s3://my-bucket/my-org/my-model | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
modelscope:// | ModelScope Hub | modelscope://LLM-Research/Llama-3.2-1B-Instruct:d3e55134 | MODELSCOPE_API_TOKEN |
gs:// | Google Cloud Storage | gs://my-bucket/my-org/my-model | GOOGLE_APPLICATION_CREDENTIALS |
/absolute/path | Local directory | /mnt/models/my-fine-tuned-llama | None |
HuggingFace (Most Common)
export MODEL=hf://meta-llama/Llama-3.1-8B-Instruct
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e HF_TOKEN=<your-token> \
-e NIM_MODEL_PATH=$MODEL \
-p 8000:8000 \
$NIM_LLM_IMAGEYour Fine-Tuned Model from S3
export MODEL=s3://my-bucket/ml-models/my-fine-tuned-llama-70b
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e NIM_MODEL_PATH=$MODEL \
-e AWS_ACCESS_KEY_ID=<your-key> \
-e AWS_SECRET_ACCESS_KEY=<your-secret> \
-e AWS_REGION=us-east-1 \
-p 8000:8000 \
$NIM_LLM_IMAGEFor S3-compatible storage (MinIO, Ceph):
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e NIM_MODEL_PATH=s3://nim-models/my-model \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_ENDPOINT_URL=http://minio.internal:9000 \
-e AWS_S3_USE_PATH_STYLE=true \
-e AWS_REGION=us-east-1 \
-p 8000:8000 \
$NIM_LLM_IMAGELocal Directory (Pre-Downloaded)
export MODEL=/mnt/models/my-fine-tuned-llama
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v /mnt/models:/mnt/models \
-e NIM_MODEL_PATH=$MODEL \
-p 8000:8000 \
$NIM_LLM_IMAGENo network access required. No credentials needed.
NGC Registry
export MODEL=ngc://nim/meta/llama-3.3-70b-instruct:hf
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e NIM_MODEL_PATH=$MODEL \
-e NGC_API_KEY=<your-key> \
-p 8000:8000 \
$NIM_LLM_IMAGEStep 3: Configure Model Profile
Model-free NIM generates profiles at runtime for your model. List them first:
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e HF_TOKEN=<token> \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
$NIM_LLM_IMAGE \
list-model-profilesOutput:
- Compatible with system and runnable:
- c214460d... (vllm-tp1-pp1-0bdd169f...) [requires >=13 GB/gpu]
- With LoRA support:
- 289b03eb... (vllm-tp1-pp1-feat_lora-0bdd169f...) [requires >=13 GB/gpu]Two ways to select a profile:
Option A: Set NIM_MODEL_PROFILE
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e HF_TOKEN=<token> \
-e NIM_MODEL_PROFILE=vllm-bf16-tp2-pp1 \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-p 8000:8000 \
$NIM_LLM_IMAGEOption B: vLLM CLI Arguments (Takes Precedence)
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e HF_TOKEN=<token> \
-p 8000:8000 \
$NIM_LLM_IMAGE \
hf://meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2The vLLM positional argument (hf://...) can also replace NIM_MODEL_PATH. If both are provided, the CLI argument wins.
For detailed profile selection mechanics, see the NIM Model Profiles Guide.
Step 4: Test the Endpoint
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 256
}'The endpoint is OpenAI-compatible. Any client that works with OpenAI’s API works with NIM.
Step 5: Deploy on Kubernetes
Single Node with Helm
# values.yaml
image:
repository: nvcr.io/nim/nim-llm
tag: "latest"
model:
modelPath: "hf://meta-llama/Llama-3.1-8B-Instruct"
hfTokenSecret: hf-token
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
persistence:
enabled: true
size: 50Gi
accessMode: ReadWriteOncekubectl create secret generic hf-token --from-literal=HF_TOKEN=<your-token>
helm install nim-llm nim-llm/ -f values.yamlMulti-Node with Custom Model
For models that need multiple nodes, combine model-free with multi-node deployment:
# values.yaml
image:
repository: nvcr.io/nim/nim-llm
tag: "latest"
model:
modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
hfTokenSecret: hf-token
ngcAPISecret: ngc-api
jsonLogging: false # Required for multi-node
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 500Gi
accessMode: ReadWriteMany
storageClass: <rwx-storage-class>See the NIM Multi-Node Deployment Guide for full details.
Air-Gapped Deployment
Model-free NIM supports air-gapped environments in two ways:
Local Path (Simplest)
Pre-stage the model and mount it:
# On a connected machine
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir /mnt/models/llama-70b
# Transfer to air-gapped environment, then run
docker run --gpus=all \
-v /mnt/models:/mnt/models \
-e NIM_MODEL_PATH=/mnt/models/llama-70b \
-p 8000:8000 \
$NIM_LLM_IMAGENo network access, no credentials, no manifest generation needed.
Cached Manifest (Remote URI Redeployment)
If you first deployed with a remote URI, NIM caches the manifest:
First deploy (network-connected): Run with credentials and remote URI. NIM downloads model, generates manifest, saves to
NIM_CACHE_PATHTransfer: Move the cache volume to the air-gapped environment
Redeploy (air-gapped): Mount the same cache. NIM finds the cached
nim_runtime_manifest.yamland skips regeneration. No credentials or network required
To force manifest regeneration after an upstream model update, delete nim_runtime_manifest.yaml from the cache directory before restarting.
S3 Environment Variables Reference
| Variable | Required | Purpose |
|---|---|---|
AWS_ACCESS_KEY_ID | Yes | AWS access key |
AWS_SECRET_ACCESS_KEY | Yes | AWS secret key |
AWS_REGION | Yes | AWS region (e.g., us-east-1) |
AWS_ENDPOINT_URL | S3-compatible only | Custom endpoint (MinIO, Ceph) |
AWS_S3_USE_PATH_STYLE | S3-compatible only | Set true for path-style endpoints |
Complete Workflow: Fine-Tuned Model to Production
Here is the end-to-end process:
1. Fine-tune model (InstructLab, LoRA, full fine-tune)
│
2. Upload to storage (S3, HuggingFace, NFS)
│
3. List profiles: list-model-profiles
│
4. Test locally: docker run with NIM_MODEL_PATH
│
5. Deploy to Kubernetes: Helm chart with model.modelPath
│
6. (Optional) Multi-node: Enable multiNode for large models
│
7. Monitor: OpenAI-compatible health + metrics endpointsExample: Fine-Tuned Llama 70B from S3
# Step 1: Upload fine-tuned model to S3
aws s3 sync ./my-fine-tuned-llama-70b s3://ml-models/llama-70b-ft-v2/
# Step 2: List profiles
docker run --gpus=all \
-e NIM_MODEL_PATH=s3://ml-models/llama-70b-ft-v2 \
-e AWS_ACCESS_KEY_ID=$AWS_KEY \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
-e AWS_REGION=us-east-1 \
$NIM_LLM_IMAGE list-model-profiles
# Step 3: Deploy with TP=4 for 4× A100
docker run --gpus=all \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-e NIM_MODEL_PATH=s3://ml-models/llama-70b-ft-v2 \
-e AWS_ACCESS_KEY_ID=$AWS_KEY \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
-e AWS_REGION=us-east-1 \
-p 8000:8000 \
$NIM_LLM_IMAGE \
--tensor-parallel-size 4
# Step 4: Test
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"llama-70b-ft-v2","messages":[{"role":"user","content":"Hello"}]}'Troubleshooting
”Architecture not supported”
The model architecture must be supported by vLLM. Check the vLLM supported models list.
Model download hangs
- Verify credentials (
HF_TOKEN,NGC_API_KEY,AWS_*) - Check network connectivity from the container
- For large models, initial download can take 30+ minutes — check container logs for progress
Profile not found
If NIM_MODEL_PROFILE does not match any generated profile:
- Run
list-model-profilesto see available profiles - Use vLLM CLI arguments instead:
--tensor-parallel-size N
Cache issues after model update
Delete the cached manifest to force regeneration:
rm $LOCAL_NIM_CACHE/nim_runtime_manifest.yamlRelated Resources
- NIM Model Profiles Guide
- NIM Multi-Node Deployment on Kubernetes
- NIM Multinode Inference (Docker)
- Run:ai Distributed Inference Tutorial
- On-Premises LLM Deployment
- NVIDIA GPU Operator on Kubernetes
- InstructLab Fine-Tuning Guide
- RHEL AI Tutorial
- Official Docs: Model-Free NIM
About the Author
I am Luca Berton, AI and Cloud Advisor. I help enterprises deploy custom and fine-tuned models into production with NVIDIA NIM. Book a consultation.