Every NIM LLM container ships with a model manifest β a catalog of pre-validated configurations called profiles. When the container starts, NIM selects exactly one profile. That profile determines which model files get downloaded and how the inference backend launches.
Getting this right matters. Pick the wrong profile and you waste GPU memory, hit OOM errors, or run at half the throughput your hardware can deliver.
Profile Naming Convention
Profiles follow a predictable naming pattern:
vllm-<precision>-tp<N>-pp1[-lora]| Component | Values | Meaning |
|---|---|---|
| Backend | vllm, sglang | Inference engine |
| Precision | bf16, fp8, mxfp4, nvfp4 | Quantization format |
| tp | 1, 2, 4, 8 | Tensor parallelism (number of GPUs) |
| pp | 1, 2 | Pipeline parallelism stages |
| -lora | present/absent | LoRA adapter support |
Examples:
vllm-bf16-tp1-pp1β BF16 on 1 GPU, no LoRAvllm-fp8-tp4-pp1-loraβ FP8 quantized across 4 GPUs with LoRAsglang-h100-bf16-tp8-pp2β SGLang on 8 GPUs with 2-stage pipeline (used for DeepSeek-R1 multinode)
The naming tells you exactly what you are getting: precision, GPU count, and capabilities.
Listing Available Profiles
Before deploying, check which profiles your container supports and which are compatible with your hardware:
docker run --rm --gpus=all \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
list-model-profilesOutput groups profiles into three categories:
MODEL PROFILES
- Compatible with system and runnable:
- dcec66a5... (vllm-bf16-tp1-pp1) [requires >=18 GB/gpu]
- With LoRA support:
- d66193b8... (vllm-bf16-tp1-pp1-feat_lora) [requires >=22 GB/gpu]
- Compatible with system but low memory:
- a1b2c3d4... (vllm-bf16-tp1-pp1) [requires >=45 GB/gpu,
try --max-model-len=4096 to reduce to >=30 GB/gpu]
- Incompatible with system:
- 27af459c... (vllm-bf16-tp2-pp1)
- 30d16624... (vllm-bf16-tp4-pp1)Each profile shows:
- Profile ID β a unique 64-character SHA hash (deterministic, version-safe)
- Description β the human-readable name (
vllm-bf16-tp1-pp1) - Memory annotation β estimated VRAM requirement per GPU
Memory-Based Classification
NIM estimates GPU VRAM for each profile by analyzing model weights, KV cache, activations, and overhead. Then classifies:
| Category | What It Means | What To Do |
|---|---|---|
| Compatible | Estimated VRAM fits in available GPU memory | Deploy normally |
| Low memory | Weights fit, but full context length exceeds VRAM | Reduce --max-model-len (suggestion provided) |
| Incompatible | Weights alone exceed GPU memory | Use higher TP or quantized precision |
For low-memory profiles, NIM tells you exactly what to do:
[requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]Apply the suggestion:
docker run --rm -it --gpus=all \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
--max-model-len 4096Reducing --max-model-len limits the maximum sequence length (input + output tokens) per request. A 70B model at BF16 with 4096 context fits on a single A100 80GB. At 128K context, it needs 4+ GPUs.
The Selection Chain
NIM uses a priority-ordered selection chain. The first selector that produces a match wins:
| Priority | Selector | Trigger | Behavior |
|---|---|---|---|
| 1 (highest) | Default profile | NIM_MODEL_PROFILE="default" | Picks best compatible profile using backend priority |
| 2 | Explicit profile | NIM_MODEL_PROFILE=<id-or-name> | Matches exact profile by SHA or description |
| 3 | Memory-aware | (automatic) | Filters profiles by VRAM fit, prefers non-LoRA unless LoRA enabled |
| 4 (lowest) | Manifest | (no env var) | Uses profile_selection_criteria from manifest based on hardware |
Automatic Selection (Most Common)
If you do not set NIM_MODEL_PROFILE, NIM automatically picks the best profile for your hardware:
docker run --rm -it --gpus=all \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latestNIM evaluates GPU device, available VRAM, and parallelism constraints. For most deployments, this is sufficient.
Intelligent Default
Setting NIM_MODEL_PROFILE="default" triggers a slightly different path β it uses backend priority ordering:
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE="default" \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latestExplicit Selection by Profile ID
For production, pin to a specific profile ID. This is deterministic and version-safe β the profile is guaranteed to match even if tags change in a future release:
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE="70edb8bb9f8511ce2ea195e3caebcc3c7191dc27fea0c8d4acf9c0d9a69e43cd" \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latestExplicit Selection by Description
If the value is not a valid profile ID, NIM tries to match it against descriptions:
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE="vllm-fp8-tp4-pp1" \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latestThis is what we use in the Run:ai distributed inference tutorial with NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 for DeepSeek-R1.
Configuration Precedence
Backend-native arguments always override profile settings:
Backend CLI args (highest) > NIM_MODEL_PROFILE (lower)If a profile specifies tp=2 but you pass --tensor-parallel-size 4, the backend launches with TP=4. NIM resolves the overridden values before model download, so the downloaded model files always match the actual launch configuration.
Common vLLM CLI Overrides
| Argument | Purpose | Default |
|---|---|---|
--tensor-parallel-size | Number of tensor-parallel GPUs | 1 |
--pipeline-parallel-size | Number of pipeline-parallel stages | 1 |
--max-model-len | Maximum sequence length | Model default |
--enable-lora | Enable LoRA adapter support | Disabled |
Example β override tensor parallelism:
docker run --rm -it --gpus=all \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest \
--tensor-parallel-size 2Precision Selection Guide
Choosing the right precision is a tradeoff between quality, memory, and throughput:
| Precision | Memory | Quality | Throughput | When to Use |
|---|---|---|---|---|
| bf16 | Baseline | Best | Baseline | Quality-critical, enough VRAM |
| fp8 | ~50% of bf16 | Near-identical | 1.5-2x faster | Default for production β best balance |
| mxfp4 | ~25% of bf16 | Slight degradation | 2-3x faster | High throughput, quality-tolerant |
| nvfp4 | ~25% of bf16 | Better than mxfp4 | 2-3x faster | NVIDIAβs optimized 4-bit format |
Recommendation: Use FP8 for production. It halves memory requirements with negligible quality loss, effectively doubling the models you can serve per GPU.
Practical Decision Tree
Do you need LoRA adapters?
βββ Yes β Select a -lora profile
βββ No β Continue
How many GPUs per node?
βββ 1 GPU β tp1
βββ 2 GPUs β tp2
βββ 4 GPUs β tp4
βββ 8 GPUs β tp8
Does the model fit at bf16?
βββ Yes (comfortable) β bf16
βββ Yes (tight) β fp8 for headroom
βββ No β fp8 or fp4
Multi-node needed? (model exceeds single node)
βββ Yes β Set pp>1 + NIM_MULTI_NODE=1
βββ No β pp1
Production deployment?
βββ Yes β Pin profile by ID
βββ Dev/test β Automatic selection is fineChanges from NIM 1.x
If you are migrating from NIM LLM 1.x, these features are removed:
| Removed Feature | 1.x Usage |
|---|---|
| Custom profile selectors | NIM_CUSTOM_SELECTOR_CLASSES="my_selector.MyCustomSelector" |
| Backend priority chain | Automatic TensorRT-LLM > vLLM > SGLang |
| Tag-based selector | NIM_TAGS_SELECTOR="llm_engine=vllm,tp=1" |
Replace all of these with NIM_MODEL_PROFILE using a profile ID or description.
Related Resources
- NVIDIA NIM Multinode Inference
- Run:ai Distributed Inference Tutorial
- Run:ai Platform Guide
- NVIDIA GPU Operator on Kubernetes
- The Inference Gold Rush
- On-Premises LLM Deployment
- FinOps for AI GPU Workloads
- Official NVIDIA Docs: Model Profiles
About the Author
I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises deploying large language models. Book a consultation.