Before deploying a NIM model, you need to answer three questions: Does NIM support my model? Does it run on my GPU? Which profile should I use?
This article consolidates the official NIM LLM Support Matrix into a single reference with practical guidance.
NIM 2.x Supported Models
NIM LLM 2.0.x ships with model-specific containers for these models:
| Model | Container | Parameters | Precisions | Max TP |
|---|---|---|---|---|
| GPT-OSS 120B | openai/gpt-oss-120b | 120B | MXFP4 | TP8 |
| GPT-OSS 20B | openai/gpt-oss-20b | 20B | MXFP4 | TP8 |
| Llama 3.1 70B Instruct | meta/llama-3.1-70b-instruct | 70B | BF16, FP8, NVFP4 | TP8 |
| Llama 3.1 8B Instruct | meta/llama-3.1-8b-instruct | 8B | BF16, FP8, NVFP4 | TP1 |
| Llama 3.3 70B Instruct | meta/llama-3.3-70b-instruct | 70B | BF16, FP8, NVFP4 | TP8 |
| Nemotron Super 49B v1.5 | nvidia/llama-3.3-nemotron-super-49b-v1.5 | 49B | BF16, FP8, NVFP4 | TP8 |
| Nemotron 3 Nano | nvidia/nemotron-3-nano | Small | BF16, FP8, NVFP4 | TP8 |
| Nemotron 3 Super 120B | nvidia/nemotron-3-super-120b-a12b | 120B (12B active) | BF16, FP8, NVFP4 | TP8 |
| StarCoder2 7B | bigcode/starcoder2-7b | 7B | BF16 | TP2 |
All models support LoRA adapters at every TP level (except StarCoder2 and some NVFP4 combinations).
Profile Matrix by Model
Llama 3.1 / 3.3 70B (Most Popular)
The workhorse model. Full precision and TP coverage:
| Precision | TP1 | TP2 | TP4 | TP8 |
|---|---|---|---|---|
| BF16 | ✅ | ✅ | ✅ | ✅ |
| BF16 + LoRA | ✅ | ✅ | ✅ | ✅ |
| FP8 | ✅ | ✅ | ✅ | ✅ |
| FP8 + LoRA | ✅ | ✅ | ✅ | ✅ |
| NVFP4 | ✅ | ✅ | ✅ | ✅ |
| NVFP4 + LoRA | ✅* | ✅ | ✅ | ✅ |
*Llama 3.3 70B: NVFP4+LoRA not available at TP1.
Recommendation: Use vllm-fp8-tp2-pp1 on 2x A100 80GB or H100. Best cost-performance ratio.
GPT-OSS 120B / 20B (OpenAI Open Models)
MXFP4 only — aggressively quantized for efficiency:
| Precision | TP1 | TP2 | TP4 | TP8 |
|---|---|---|---|---|
| MXFP4 | ✅ | ✅ | ✅ | ✅ |
| MXFP4 + LoRA | ✅ | ✅ | ✅ | ✅ |
Nemotron Super 120B (MoE — 12B Active)
This is a Mixture of Experts model with 120B total but only 12B active parameters. Profile availability varies significantly by GPU:
- B200/B300/GB200: Full coverage (BF16/FP8/NVFP4, TP1-TP8)
- H100/H200: BF16 from TP2, FP8 from TP1, NVFP4 limited
- A100 80GB: BF16 from TP4, FP8 from TP2
- L40S: FP8 TP8 only, NVFP4 TP4+
Llama 3.1 8B Instruct
Single-GPU model — no multi-GPU profiles needed:
| Precision | TP1 |
|---|---|
| BF16 | ✅ |
| BF16 + LoRA | ✅ |
| FP8 | ✅ |
| FP8 + LoRA | ✅ |
| NVFP4 | ✅ |
| NVFP4 + LoRA | ✅ |
Verified GPU Compatibility
Which Models Run on My GPU?
| GPU | Verified Models |
|---|---|
| B200 | All 9 models |
| B300 SXM6 AC | All 9 models |
| GB200 | All 9 models |
| H200 | GPT-OSS 120B/20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B, StarCoder2 |
| H200 NVL | Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| H100 80GB HBM3 | All 9 models |
| H100 NVL | Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| GH200 144G HBM3e | GPT-OSS 120B/20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| GH200 480GB | GPT-OSS 20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano |
| A100 SXM4 80GB | GPT-OSS 120B/20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| A100 SXM4 40GB | GPT-OSS 120B/20B, Llama 70B/8B, Nemotron Super 49B, Nemotron Nano |
| A10G | GPT-OSS 20B, Llama 70B/3.3 |
| L40S | GPT-OSS 120B/20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| RTX PRO 6000 Blackwell SE | GPT-OSS 120B/20B, Llama 70B/8B/3.3, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| RTX PRO 4500 Blackwell SE | GPT-OSS 20B, Nemotron Super 49B, Nemotron Nano, Nemotron Super 120B |
| GB10 | GPT-OSS 20B, Llama 8B, Nemotron Super 49B, Nemotron Nano |
Key Observations
Blackwell GPUs (B200, B300, GB200) support every model at every precision — the most versatile option.
H100 80GB remains the production workhorse. Supports all 9 models. FP8 effectively doubles capacity vs BF16.
A100 40GB is limited but functional. Smaller models (8B, 20B) work fine. 70B requires FP8 or NVFP4 quantization.
L40S is the cost-effective inference GPU. Supports most models but larger ones (120B) need TP8 with FP8.
GB10 (DGX Spark) is desktop-class. Only small models (8B, 20B, Nano).
Model-Free NIM
The generic nvidia/model-free-nim container supports any vLLM-compatible model, not just the ones listed above. Explicitly validated models:
- GPT-OSS 20B
- Apriel Nemotron
- Codestral
Verified GPUs for model-free NIM:
- A100 (40GB PCIe, 80GB PCIe, 40GB SXM4, 80GB SXM4)
- B300 SXM6 AC
- GH200 480GB
- H100 (80GB HBM3, NVL, PCIe)
- H200, H200 NVL
- RTX PRO 4500 Blackwell SE
For deployment details, see the Model-Free NIM Guide.
NIM 1.x Legacy Models
These models are supported in NIM LLM 1.15 and earlier (not yet migrated to 2.x):
| Model | Container |
|---|---|
| DeepSeek-V3.1 Terminus | deepseek-ai/deepseek-v3.1-terminus |
| DeepSeek-V3.2 Exp | deepseek-ai/deepseek-v32-exp-nim |
| GLM-5 | zai-org/glm-5 |
| MiniMax-M2.5 | minimax-ai/minimax-m25 |
| Nemotron Nano 9B v2 (DGX Spark) | nvidia/nvidia-nemotron-nano-9b-v2-dgx-spark |
| Qwen3 Coder Next | qwen/qwen3-coder-next |
| Qwen3 Next 80B A3B Instruct | qwen/qwen3-next-80b-a3b-instruct |
| Qwen3 Next 80B A3B Thinking | qwen/qwen3-next-80b-a3b-thinking |
| Qwen3 32B | qwen/qwen3-32b |
| Qwen3 32B (DGX Spark) | qwen/qwen3-32b-dgx-spark |
| Riva Translate 4B v1.1 | nvidia/riva-translate-4b-instruct-v1.1 |
| Healthcare Text2SQL (8B) | nvidia/llama-3.1-nemotron-nano-8b-healthcare-text2sql-v1.0 |
| Healthcare Text2SQL (49B) | nvidia/llama-3.3-nemotron-super-49b-healthcare-text2sql-v1.0 |
For 1.x deployment, refer to the NIM LLM 1.15 supported models documentation.
Quick Decision Guide
What GPU do you have?
├── B200/B300/GB200 → Any model, any precision, any TP
├── H100/H200 80GB → Any model, prefer FP8
├── A100 80GB → Most models, prefer FP8 for 70B+
├── A100 40GB → 8B-20B models only (or FP8/NVFP4 for 70B)
├── L40S → Most models, FP8 recommended, large models need TP8
├── A10G → 20B and 70B only
└── GB10 → 8B, 20B, Nano only
What model do you need?
├── General purpose → Llama 3.3 70B (FP8)
├── Code generation → StarCoder2 7B or model-free with Codestral
├── OpenAI compatible → GPT-OSS 20B/120B (MXFP4)
├── NVIDIA optimized → Nemotron Super 49B or 120B
├── Small/edge → Llama 8B or Nemotron Nano
└── Custom/fine-tuned → Model-free NIMRelated Resources
- NIM Model Profiles Guide
- Model-Free NIM Deployment
- Multi-Node Deployment on Kubernetes
- NIM Multinode Inference (Docker)
- NVIDIA GPU Operator on Kubernetes
- On-Premises LLM Deployment
- FinOps for AI GPU Workloads
- Official NIM Support Matrix
About the Author
I am Luca Berton, AI and Cloud Advisor. I help enterprises select the right GPU and model configuration for their inference workloads. Book a consultation.
Frequently Asked Questions
Which GPUs are supported by NVIDIA NIM?
NVIDIA NIM supports A100 (40/80GB), H100, H200, L40S, L4, and A10G GPUs. Model availability varies by GPU memory.
Can I run NIM on consumer GPUs like RTX 4090?
NIM is designed for data center GPUs. Consumer GPUs are not officially supported, though smaller models may work with vLLM directly.