Before putting an LLM into production, you need to answer: how many concurrent users can this deployment handle? Whatβs the P99 latency? What happens under load? NVIDIAβs GenAI-Perf is the standard tool for answering these questions against vLLM, TensorRT-LLM, and NIM endpoints.
What GenAI-Perf Measures
GenAI-Perf generates synthetic inference requests and measures:
- Time to First Token (TTFT) β how long until the first token appears
- Inter-Token Latency (ITL) β time between subsequent tokens
- Output Token Throughput β tokens/second across all concurrent requests
- Request Throughput β completed requests/second
- End-to-End Latency β total time per request (P50, P90, P99)
Installation
GenAI-Perf is part of the NVIDIA Triton Inference Server SDK:
pip install genai-perf
# Or from the NGC container
docker run --rm -it \
nvcr.io/nvidia/tritonserver:26.02-py3-sdk \
bashBasic Benchmark: Single GPU vLLM
# Deploy vLLM endpoint first
# vLLM serving on port 8000 with OpenAI-compatible API
genai-perf profile \
--model mistral-small \
--backend vllm \
--endpoint-type chat \
--url http://inference-endpoint:8000 \
--concurrency 1 \
--input-tokens-mean 200 \
--output-tokens-mean 100 \
--num-prompts 100 \
--streamingOutput:
LLM Metrics
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metric β P50 β P90 β P99 β
βββββββββββββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββ€
β Time to First Token (ms) β 45 β 62 β 89 β
β Inter Token Latency (ms) β 12 β 15 β 21 β
β Request Latency (ms) β 1245 β 1580 β 1890 β
β Output Token Throughput β 82 tokens/s β
β Request Throughput β 0.79 req/s β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββScaling Concurrency
The key question: how does performance degrade as concurrent users increase?
# Test with increasing concurrency
for concurrency in 1 2 4 8 16 32; do
echo "=== Concurrency: $concurrency ==="
genai-perf profile \
--model mistral-small \
--backend vllm \
--endpoint-type chat \
--url http://inference-endpoint:8000 \
--concurrency $concurrency \
--input-tokens-mean 200 \
--output-tokens-mean 100 \
--num-prompts 200 \
--streaming
doneTypical results for a single H100 with Mistral Small:
| Concurrency | Throughput (tok/s) | TTFT P50 | TTFT P99 | ITL P99 |
|---|---|---|---|---|
| 1 | 82 | 45ms | 89ms | 21ms |
| 2 | 158 | 48ms | 95ms | 23ms |
| 4 | 295 | 55ms | 120ms | 28ms |
| 8 | 510 | 72ms | 180ms | 35ms |
| 16 | 780 | 110ms | 350ms | 52ms |
| 32 | 920 | 250ms | 800ms | 88ms |
The sweet spot is usually where throughput plateaus but latency remains acceptable for your SLA.
Multi-Location Testing
In enterprise environments, test from multiple network paths to understand the impact of network topology:
# Test from 3 different locations:
# 1. Same cluster (pod-to-pod)
# 2. Admin node (behind HA proxy/load balancer)
# 3. User laptop (through ingress + TLS)
# Location 1: Cluster-internal
genai-perf profile \
--url http://vllm-service.ai-namespace.svc:8000 \
--concurrency 8 \
--input-tokens-mean 200 \
--output-tokens-mean 100
# Location 2: Admin node (via HA Proxy)
genai-perf profile \
--url https://inference.internal.company.com \
--concurrency 8 \
--input-tokens-mean 200 \
--output-tokens-mean 100
# Location 3: External (via ingress)
genai-perf profile \
--url https://api.company.com/v1 \
--concurrency 8 \
--input-tokens-mean 200 \
--output-tokens-mean 100Expected latency differences:
| Location | Network Hops | TTFT Overhead |
|---|---|---|
| Cluster-internal | 0 (pod-to-pod) | +0ms |
| Admin node (HA Proxy) | 2-3 hops | +5-15ms |
| External (ingress + TLS) | 4-6 hops | +20-50ms |
Multi-GPU Inference Benchmarking
For distributed inference with tensor parallelism:
# Benchmark with higher concurrency to saturate multi-GPU deployment
genai-perf profile \
--model mistral-small \
--backend vllm \
--endpoint-type chat \
--url http://inference-endpoint:8000 \
--concurrency 16 \
--input-tokens-mean 200 \
--output-tokens-mean 200 \
--num-prompts 500 \
--streaming \
--request-rate 50 # Target 50 req/sCompare single GPU vs multi-GPU:
| Setup | Max Throughput | TTFT P99 @ max | Cost/token |
|---|---|---|---|
| 1Γ H100 (TP=1) | 920 tok/s | 800ms | $0.42/1M |
| 2Γ H100 (TP=2) | 1,650 tok/s | 450ms | $0.46/1M |
| 4Γ H100 (TP=4) | 2,800 tok/s | 280ms | $0.54/1M |
More GPUs reduce latency and increase throughput, but cost per token increases due to communication overhead.
Advanced: Input Token Distribution
Real-world traffic has variable prompt lengths. Test with realistic distributions:
# Short prompts (chatbot-style)
genai-perf profile \
--input-tokens-mean 100 \
--input-tokens-stddev 50 \
--output-tokens-mean 150 \
--output-tokens-stddev 100
# Long prompts (RAG with context)
genai-perf profile \
--input-tokens-mean 2000 \
--input-tokens-stddev 500 \
--output-tokens-mean 200 \
--output-tokens-stddev 50
# Code generation (long output)
genai-perf profile \
--input-tokens-mean 500 \
--input-tokens-stddev 200 \
--output-tokens-mean 1000 \
--output-tokens-stddev 300Interpreting Results for Capacity Planning
SLA-Based Capacity
If your SLA requires TTFT under 500ms at P99:
- Run concurrency sweep (1 β 64)
- Find the concurrency where TTFT P99 = 500ms (e.g., concurrency=24)
- Measure throughput at that concurrency (e.g., 850 tok/s)
- Calculate:
users_supported = throughput / avg_output_tokens_per_request - Example:
850 / 150 = ~5.6 simultaneous requests
Cost Optimization
# Compare different quantization levels
# FP16 baseline
genai-perf profile --model mistral-fp16 --concurrency 8
# FP8 quantized
genai-perf profile --model mistral-fp8 --concurrency 8
# INT4 (AWQ/GPTQ)
genai-perf profile --model mistral-awq --concurrency 8Typical quantization impact:
| Precision | Throughput | Quality (MMLU) | Memory |
|---|---|---|---|
| FP16 | 920 tok/s | 82.1% | 24 GB |
| FP8 | 1,350 tok/s | 81.8% | 12 GB |
| INT4 (AWQ) | 1,800 tok/s | 80.5% | 7 GB |
Automation: CI/CD Performance Regression
# GitHub Actions: run perf test on every model update
- name: Benchmark inference
run: |
genai-perf profile \
--model ${{ env.MODEL_NAME }} \
--backend vllm \
--concurrency 8 \
--input-tokens-mean 200 \
--output-tokens-mean 100 \
--num-prompts 100 \
--output-format json \
--output-file results.json
- name: Check regression
run: |
python scripts/check_perf_regression.py \
--baseline baseline.json \
--current results.json \
--ttft-p99-threshold 500 \
--throughput-min 800Related Articles
- Distributed vs Multi-GPU Inference β architecture decisions
- NVIDIA NIM Support Matrix β supported model/GPU combinations
- NVIDIA NIM Model Profiles β performance profile selection
- The Inference Economy β cost optimization strategies
- LLM Quality vs Cost vs Safety β trade-off analysis
Benchmark before you deploy, benchmark after you optimize, benchmark when traffic patterns change. GenAI-Perf makes this a 5-minute task instead of a week-long project.