Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
vLLM Recipes β€” deploy any model on any hardware
AI

vLLM Recipes: Deploy Any Model on Any GPU

Community-maintained vLLM serve configurations for every major model on NVIDIA H100/H200/B200, Grace-Blackwell, and AMD MI300X/MI325X hardware.

LB
Luca Berton
Β· 3 min read

The hardest part of self-hosting LLMs is not choosing the model β€” it is figuring out the exact vllm serve command that works for your specific model on your specific GPUs. Tensor parallelism degree, quantization flags, max model length, GPU memory utilization β€” get one wrong and you either OOM or leave performance on the table.

vLLM Recipes solves this with a community-maintained catalog of tested configurations. Pick your model, pick your GPU, copy the serve command. Done.

What Is vLLM Recipes?

vLLM Recipes is an open-source project (github.com/vllm-project/recipes) that answers one question:

How do I run model X on hardware Y for task Z?

Each recipe is a structured YAML file containing the exact vllm serve command with all necessary parameters β€” tensor parallelism, quantization, context length, memory settings β€” validated against specific GPU configurations.

The site at recipes.vllm.ai renders these into a searchable interface: filter by model family, GPU type, or task, and get a copy-paste command.

Supported Hardware

Recipes cover the full spectrum of inference-grade GPUs:

NVIDIA:

  • H100 (80GB SXM/PCIe) β€” the current workhorse
  • H200 (141GB HBM3e) β€” extended memory for larger models
  • B200 (192GB HBM3e) β€” Blackwell generation
  • B300 β€” next-gen Blackwell
  • Grace-Blackwell (GB200/NVL72) β€” rack-scale with NVLink domain

AMD:

  • MI300X (192GB HBM3) β€” competitive ROCm alternative
  • MI325X (256GB HBM3e) β€” extended memory variant
  • MI355X β€” next-gen CDNA

Model Coverage

The recipe catalog spans every major model family. Here is what is available today:

Reasoning models:

  • DeepSeek-R1, DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2
  • Kimi-K2, Kimi-K2-Think, Kimi-K2.5
  • Qwen3, Qwen3.5, Qwen3-Next
  • Intern-S1

Vision-language models:

  • DeepSeek-OCR
  • Qwen3-VL, Qwen2.5-VL
  • InternVL3.5
  • GLM-4.5V, GLM-4.6V
  • Nemotron-Nano-12B-v2-VL
  • PaddleOCR-VL, HunyuanOCR

Code and specialized models:

  • Qwen3-Coder-480B-A35B
  • Qwen3-ASR (speech recognition)
  • TranslateGemma
  • Jina-reranker-m0

Dense and MoE models:

  • Llama 4 Scout, Llama 3.3-70B, Llama 3.1
  • Mistral Large 3, Ministral-3
  • Phi-4
  • Gemma 4
  • MiniMax-M2 family
  • Ring-1T-FP8 (1 trillion parameters)
  • GPT-OSS (OpenAI open-source)

Anatomy of a Recipe

Each recipe YAML specifies:

model: deepseek-ai/DeepSeek-V3.1
hardware:
  - gpu: H100-80GB-SXM
    count: 8
    tensor_parallel: 8
serve_command: >
  vllm serve deepseek-ai/DeepSeek-V3.1
  --tensor-parallel-size 8
  --max-model-len 32768
  --gpu-memory-utilization 0.92
  --quantization fp8
  --trust-remote-code

The key parameters that vary per recipe:

  • --tensor-parallel-size β€” how many GPUs share the model (must match hardware)
  • --max-model-len β€” context window (trades memory for capability)
  • --gpu-memory-utilization β€” how aggressively to fill VRAM (0.90-0.95 typical)
  • --quantization β€” fp8, awq, gptq, or none
  • --enforce-eager β€” disable CUDA graphs when memory is tight
  • --enable-prefix-caching β€” turn on automatic prefix caching for shared-prefix workloads

Why This Matters

Eliminating Trial and Error

Without recipes, deploying a new model means:

  1. Reading the model card for recommended settings
  2. Guessing tensor parallelism based on VRAM math
  3. Running into OOM, adjusting --max-model-len down
  4. Discovering you need --trust-remote-code or a specific quantization
  5. Repeating for each GPU type in your fleet

Recipes compress this to a single copy-paste. The community has already done the trial and error.

Multi-GPU Configurations

The most valuable recipes are for multi-GPU setups where the interaction between tensor parallelism, pipeline parallelism, and memory allocation is non-obvious:

  • DeepSeek-R1 on 8x H100: TP=8, FP8 quantization, 32K context
  • Llama 4 Scout on 4x H200: TP=4, native precision, 128K context (enough HBM3e)
  • Ring-1T on NVL72: TP=72 across the full rack with Wide-EP

Hardware Comparison

Recipes also serve as an implicit hardware comparison tool. When DeepSeek-V3.1 needs 8x H100 at FP8 but runs on 4x H200 at FP16, that tells you more about the practical difference between H100 and H200 than any spec sheet.

Contributing

The project uses structured YAML with validation. To add a recipe:

git clone https://github.com/vllm-project/recipes
cd recipes
pnpm install
# Add your YAML to models/<hf_org>/<hf_repo>.yaml
node scripts/build-recipes-api.mjs  # validates YAML + rebuilds JSON API
pnpm dev  # preview at localhost:3000

The CONTRIBUTING.md includes the full schema, VRAM formula for calculating memory requirements, and validation steps.

From Recipes to Production

vLLM Recipes gives you the serve command. For production, you still need:

  • NVIDIA Dynamo β€” disaggregated serving, SLA-aware autoscaling
  • llm-d β€” KV-cache-aware routing for multi-pod deployments
  • Wide Expert Parallelism β€” rack-scale MoE optimization on NVL72
  • NIM β€” managed inference with model profiles and multi-node support

The recipe gets you running. The infrastructure stack gets you to production SLAs.

Free 30-min AI & Cloud consultation

Book Now