Skip to main content
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
NVIDIA NIM Model Profiles Selection Guide 2026
AI

NIM Model Profiles: GPU Memory and Selection Guide

NIM model profiles control which precision, parallelism, and backend your LLM uses. Profile naming, memory classification, selection chain priority, and.

LB
Luca Berton
Β· 4 min read

Every NIM LLM container ships with a model manifest β€” a catalog of pre-validated configurations called profiles. When the container starts, NIM selects exactly one profile. That profile determines which model files get downloaded and how the inference backend launches.

Getting this right matters. Pick the wrong profile and you waste GPU memory, hit OOM errors, or run at half the throughput your hardware can deliver.

Profile Naming Convention

Profiles follow a predictable naming pattern:

vllm-<precision>-tp<N>-pp1[-lora]
ComponentValuesMeaning
Backendvllm, sglangInference engine
Precisionbf16, fp8, mxfp4, nvfp4Quantization format
tp1, 2, 4, 8Tensor parallelism (number of GPUs)
pp1, 2Pipeline parallelism stages
-lorapresent/absentLoRA adapter support

Examples:

  • vllm-bf16-tp1-pp1 β€” BF16 on 1 GPU, no LoRA
  • vllm-fp8-tp4-pp1-lora β€” FP8 quantized across 4 GPUs with LoRA
  • sglang-h100-bf16-tp8-pp2 β€” SGLang on 8 GPUs with 2-stage pipeline (used for DeepSeek-R1 multinode)

The naming tells you exactly what you are getting: precision, GPU count, and capabilities.

Listing Available Profiles

Before deploying, check which profiles your container supports and which are compatible with your hardware:

docker run --rm --gpus=all \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
  list-model-profiles

Output groups profiles into three categories:

MODEL PROFILES
- Compatible with system and runnable:
  - dcec66a5... (vllm-bf16-tp1-pp1) [requires >=18 GB/gpu]
  - With LoRA support:
    - d66193b8... (vllm-bf16-tp1-pp1-feat_lora) [requires >=22 GB/gpu]

- Compatible with system but low memory:
  - a1b2c3d4... (vllm-bf16-tp1-pp1) [requires >=45 GB/gpu,
    try --max-model-len=4096 to reduce to >=30 GB/gpu]

- Incompatible with system:
  - 27af459c... (vllm-bf16-tp2-pp1)
  - 30d16624... (vllm-bf16-tp4-pp1)

Each profile shows:

  • Profile ID β€” a unique 64-character SHA hash (deterministic, version-safe)
  • Description β€” the human-readable name (vllm-bf16-tp1-pp1)
  • Memory annotation β€” estimated VRAM requirement per GPU

Memory-Based Classification

NIM estimates GPU VRAM for each profile by analyzing model weights, KV cache, activations, and overhead. Then classifies:

CategoryWhat It MeansWhat To Do
CompatibleEstimated VRAM fits in available GPU memoryDeploy normally
Low memoryWeights fit, but full context length exceeds VRAMReduce --max-model-len (suggestion provided)
IncompatibleWeights alone exceed GPU memoryUse higher TP or quantized precision

For low-memory profiles, NIM tells you exactly what to do:

[requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]

Apply the suggestion:

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
  --max-model-len 4096

Reducing --max-model-len limits the maximum sequence length (input + output tokens) per request. A 70B model at BF16 with 4096 context fits on a single A100 80GB. At 128K context, it needs 4+ GPUs.

The Selection Chain

NIM uses a priority-ordered selection chain. The first selector that produces a match wins:

PrioritySelectorTriggerBehavior
1 (highest)Default profileNIM_MODEL_PROFILE="default"Picks best compatible profile using backend priority
2Explicit profileNIM_MODEL_PROFILE=<id-or-name>Matches exact profile by SHA or description
3Memory-aware(automatic)Filters profiles by VRAM fit, prefers non-LoRA unless LoRA enabled
4 (lowest)Manifest(no env var)Uses profile_selection_criteria from manifest based on hardware

Automatic Selection (Most Common)

If you do not set NIM_MODEL_PROFILE, NIM automatically picks the best profile for your hardware:

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

NIM evaluates GPU device, available VRAM, and parallelism constraints. For most deployments, this is sufficient.

Intelligent Default

Setting NIM_MODEL_PROFILE="default" triggers a slightly different path β€” it uses backend priority ordering:

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE="default" \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

Explicit Selection by Profile ID

For production, pin to a specific profile ID. This is deterministic and version-safe β€” the profile is guaranteed to match even if tags change in a future release:

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE="70edb8bb9f8511ce2ea195e3caebcc3c7191dc27fea0c8d4acf9c0d9a69e43cd" \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

Explicit Selection by Description

If the value is not a valid profile ID, NIM tries to match it against descriptions:

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE="vllm-fp8-tp4-pp1" \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

This is what we use in the Run:ai distributed inference tutorial with NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 for DeepSeek-R1.

Configuration Precedence

Backend-native arguments always override profile settings:

Backend CLI args (highest) > NIM_MODEL_PROFILE (lower)

If a profile specifies tp=2 but you pass --tensor-parallel-size 4, the backend launches with TP=4. NIM resolves the overridden values before model download, so the downloaded model files always match the actual launch configuration.

Common vLLM CLI Overrides

ArgumentPurposeDefault
--tensor-parallel-sizeNumber of tensor-parallel GPUs1
--pipeline-parallel-sizeNumber of pipeline-parallel stages1
--max-model-lenMaximum sequence lengthModel default
--enable-loraEnable LoRA adapter supportDisabled

Example β€” override tensor parallelism:

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
  --tensor-parallel-size 2

Precision Selection Guide

Choosing the right precision is a tradeoff between quality, memory, and throughput:

PrecisionMemoryQualityThroughputWhen to Use
bf16BaselineBestBaselineQuality-critical, enough VRAM
fp8~50% of bf16Near-identical1.5-2x fasterDefault for production β€” best balance
mxfp4~25% of bf16Slight degradation2-3x fasterHigh throughput, quality-tolerant
nvfp4~25% of bf16Better than mxfp42-3x fasterNVIDIA’s optimized 4-bit format

Recommendation: Use FP8 for production. It halves memory requirements with negligible quality loss, effectively doubling the models you can serve per GPU.

Practical Decision Tree

Do you need LoRA adapters?
β”œβ”€β”€ Yes β†’ Select a -lora profile
└── No β†’ Continue

How many GPUs per node?
β”œβ”€β”€ 1 GPU β†’ tp1
β”œβ”€β”€ 2 GPUs β†’ tp2
β”œβ”€β”€ 4 GPUs β†’ tp4
└── 8 GPUs β†’ tp8

Does the model fit at bf16?
β”œβ”€β”€ Yes (comfortable) β†’ bf16
β”œβ”€β”€ Yes (tight) β†’ fp8 for headroom
└── No β†’ fp8 or fp4

Multi-node needed? (model exceeds single node)
β”œβ”€β”€ Yes β†’ Set pp>1 + NIM_MULTI_NODE=1
└── No β†’ pp1

Production deployment?
β”œβ”€β”€ Yes β†’ Pin profile by ID
└── Dev/test β†’ Automatic selection is fine

Changes from NIM 1.x

If you are migrating from NIM LLM 1.x, these features are removed:

Removed Feature1.x Usage
Custom profile selectorsNIM_CUSTOM_SELECTOR_CLASSES="my_selector.MyCustomSelector"
Backend priority chainAutomatic TensorRT-LLM > vLLM > SGLang
Tag-based selectorNIM_TAGS_SELECTOR="llm_engine=vllm,tp=1"

Replace all of these with NIM_MODEL_PROFILE using a profile ID or description.

About the Author

I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises deploying large language models. Book a consultation.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens Heaven Art Shop TechMeOut

Free 30-min AI & Cloud consultation

Book Now