Skip to main content
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
NVIDIA NIM Model-Free Custom Model Deployment Guide 2026
AI

NVIDIA NIM Model-Free Deployment: Custom and

Deploy your own fine-tuned or custom model with NVIDIA NIM. Model-free NIM supports HuggingFace, NGC, S3, local paths, and air-gapped environments with.

LB
Luca Berton
· 3 min read

You fine-tuned a model. Now you need to serve it. NVIDIA NIM’s model-free mode lets you deploy any supported model using a single generic container — no model-specific NIM image required.

One container image passes security review. It serves Llama, Mistral, your fine-tuned variant, or any model with a vLLM-supported architecture. Point it at HuggingFace, S3, NGC, or a local directory.

When to Use Model-Free NIM

ScenarioModel-Free NIMModel-Specific NIM
Fine-tuned custom model✅ Best choice❌ Not available
Day-zero newly released model✅ Immediate❌ Wait for NVIDIA to publish
One container for all models✅ Single image❌ Different image per model
Pre-optimized TensorRT-LLM❌ vLLM only✅ May include TRT-LLM profiles
NGC-curated profiles❌ Generic profiles✅ Pre-validated for specific hardware

Key limitation: Model-free NIM uses vLLM as the backend. If a model architecture is unsupported by vLLM, it will not work with model-free NIM either.

Step 1: Set Up the NIM Container Image

Pull the generic NIM LLM container:

export NIM_LLM_IMAGE=nvcr.io/nim/nim-llm:latest

docker pull $NIM_LLM_IMAGE

Create a local cache directory to avoid re-downloading models on restart:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE

Step 2: Choose Your Model Source

Model-free NIM supports six model sources:

PrefixSourceExampleAuthentication
hf://HuggingFace Hubhf://meta-llama/Llama-3.1-8B-InstructHF_TOKEN
ngc://NVIDIA NGCngc://nim/meta/llama-3.3-70b-instruct:hfNGC_API_KEY
s3://AWS S3 / S3-compatibles3://my-bucket/my-org/my-modelAWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
modelscope://ModelScope Hubmodelscope://LLM-Research/Llama-3.2-1B-Instruct:d3e55134MODELSCOPE_API_TOKEN
gs://Google Cloud Storagegs://my-bucket/my-org/my-modelGOOGLE_APPLICATION_CREDENTIALS
/absolute/pathLocal directory/mnt/models/my-fine-tuned-llamaNone

HuggingFace (Most Common)

export MODEL=hf://meta-llama/Llama-3.1-8B-Instruct

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e HF_TOKEN=<your-token> \
  -e NIM_MODEL_PATH=$MODEL \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

Your Fine-Tuned Model from S3

export MODEL=s3://my-bucket/ml-models/my-fine-tuned-llama-70b

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=$MODEL \
  -e AWS_ACCESS_KEY_ID=<your-key> \
  -e AWS_SECRET_ACCESS_KEY=<your-secret> \
  -e AWS_REGION=us-east-1 \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

For S3-compatible storage (MinIO, Ceph):

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=s3://nim-models/my-model \
  -e AWS_ACCESS_KEY_ID=<key> \
  -e AWS_SECRET_ACCESS_KEY=<secret> \
  -e AWS_ENDPOINT_URL=http://minio.internal:9000 \
  -e AWS_S3_USE_PATH_STYLE=true \
  -e AWS_REGION=us-east-1 \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

Local Directory (Pre-Downloaded)

export MODEL=/mnt/models/my-fine-tuned-llama

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v /mnt/models:/mnt/models \
  -e NIM_MODEL_PATH=$MODEL \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

No network access required. No credentials needed.

NGC Registry

export MODEL=ngc://nim/meta/llama-3.3-70b-instruct:hf

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=$MODEL \
  -e NGC_API_KEY=<your-key> \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

Step 3: Configure Model Profile

Model-free NIM generates profiles at runtime for your model. List them first:

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e HF_TOKEN=<token> \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  $NIM_LLM_IMAGE \
  list-model-profiles

Output:

- Compatible with system and runnable:
  - c214460d... (vllm-tp1-pp1-0bdd169f...) [requires >=13 GB/gpu]
  - With LoRA support:
    - 289b03eb... (vllm-tp1-pp1-feat_lora-0bdd169f...) [requires >=13 GB/gpu]

Two ways to select a profile:

Option A: Set NIM_MODEL_PROFILE

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e HF_TOKEN=<token> \
  -e NIM_MODEL_PROFILE=vllm-bf16-tp2-pp1 \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

Option B: vLLM CLI Arguments (Takes Precedence)

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e HF_TOKEN=<token> \
  -p 8000:8000 \
  $NIM_LLM_IMAGE \
  hf://meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2

The vLLM positional argument (hf://...) can also replace NIM_MODEL_PATH. If both are provided, the CLI argument wins.

For detailed profile selection mechanics, see the NIM Model Profiles Guide.

Step 4: Test the Endpoint

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 256
  }'

The endpoint is OpenAI-compatible. Any client that works with OpenAI’s API works with NIM.

Step 5: Deploy on Kubernetes

Single Node with Helm

# values.yaml
image:
  repository: nvcr.io/nim/nim-llm
  tag: "latest"

model:
  modelPath: "hf://meta-llama/Llama-3.1-8B-Instruct"
  hfTokenSecret: hf-token

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi
  accessMode: ReadWriteOnce
kubectl create secret generic hf-token --from-literal=HF_TOKEN=<your-token>
helm install nim-llm nim-llm/ -f values.yaml

Multi-Node with Custom Model

For models that need multiple nodes, combine model-free with multi-node deployment:

# values.yaml
image:
  repository: nvcr.io/nim/nim-llm
  tag: "latest"

model:
  modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
  hfTokenSecret: hf-token
  ngcAPISecret: ngc-api
  jsonLogging: false  # Required for multi-node

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 500Gi
  accessMode: ReadWriteMany
  storageClass: <rwx-storage-class>

See the NIM Multi-Node Deployment Guide for full details.

Air-Gapped Deployment

Model-free NIM supports air-gapped environments in two ways:

Local Path (Simplest)

Pre-stage the model and mount it:

# On a connected machine
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
  --local-dir /mnt/models/llama-70b

# Transfer to air-gapped environment, then run
docker run --gpus=all \
  -v /mnt/models:/mnt/models \
  -e NIM_MODEL_PATH=/mnt/models/llama-70b \
  -p 8000:8000 \
  $NIM_LLM_IMAGE

No network access, no credentials, no manifest generation needed.

Cached Manifest (Remote URI Redeployment)

If you first deployed with a remote URI, NIM caches the manifest:

  1. First deploy (network-connected): Run with credentials and remote URI. NIM downloads model, generates manifest, saves to NIM_CACHE_PATH

  2. Transfer: Move the cache volume to the air-gapped environment

  3. Redeploy (air-gapped): Mount the same cache. NIM finds the cached nim_runtime_manifest.yaml and skips regeneration. No credentials or network required

To force manifest regeneration after an upstream model update, delete nim_runtime_manifest.yaml from the cache directory before restarting.

S3 Environment Variables Reference

VariableRequiredPurpose
AWS_ACCESS_KEY_IDYesAWS access key
AWS_SECRET_ACCESS_KEYYesAWS secret key
AWS_REGIONYesAWS region (e.g., us-east-1)
AWS_ENDPOINT_URLS3-compatible onlyCustom endpoint (MinIO, Ceph)
AWS_S3_USE_PATH_STYLES3-compatible onlySet true for path-style endpoints

Complete Workflow: Fine-Tuned Model to Production

Here is the end-to-end process:

1. Fine-tune model (InstructLab, LoRA, full fine-tune)

2. Upload to storage (S3, HuggingFace, NFS)

3. List profiles: list-model-profiles

4. Test locally: docker run with NIM_MODEL_PATH

5. Deploy to Kubernetes: Helm chart with model.modelPath

6. (Optional) Multi-node: Enable multiNode for large models

7. Monitor: OpenAI-compatible health + metrics endpoints

Example: Fine-Tuned Llama 70B from S3

# Step 1: Upload fine-tuned model to S3
aws s3 sync ./my-fine-tuned-llama-70b s3://ml-models/llama-70b-ft-v2/

# Step 2: List profiles
docker run --gpus=all \
  -e NIM_MODEL_PATH=s3://ml-models/llama-70b-ft-v2 \
  -e AWS_ACCESS_KEY_ID=$AWS_KEY \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
  -e AWS_REGION=us-east-1 \
  $NIM_LLM_IMAGE list-model-profiles

# Step 3: Deploy with TP=4 for 4× A100
docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=s3://ml-models/llama-70b-ft-v2 \
  -e AWS_ACCESS_KEY_ID=$AWS_KEY \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
  -e AWS_REGION=us-east-1 \
  -p 8000:8000 \
  $NIM_LLM_IMAGE \
  --tensor-parallel-size 4

# Step 4: Test
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"llama-70b-ft-v2","messages":[{"role":"user","content":"Hello"}]}'

Troubleshooting

”Architecture not supported”

The model architecture must be supported by vLLM. Check the vLLM supported models list.

Model download hangs

  • Verify credentials (HF_TOKEN, NGC_API_KEY, AWS_*)
  • Check network connectivity from the container
  • For large models, initial download can take 30+ minutes — check container logs for progress

Profile not found

If NIM_MODEL_PROFILE does not match any generated profile:

  • Run list-model-profiles to see available profiles
  • Use vLLM CLI arguments instead: --tensor-parallel-size N

Cache issues after model update

Delete the cached manifest to force regeneration:

rm $LOCAL_NIM_CACHE/nim_runtime_manifest.yaml

About the Author

I am Luca Berton, AI and Cloud Advisor. I help enterprises deploy custom and fine-tuned models into production with NVIDIA NIM. Book a consultation.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens Heaven Art Shop TechMeOut

Free 30-min AI & Cloud consultation

Book Now