Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
GenAI-Perf NVIDIA LLM inference benchmarking tool
AI

GenAI-Perf: Benchmarking LLM Inference with NVIDIA's

How to benchmark vLLM inference endpoints using NVIDIA GenAI-Perf from multiple network locations. Covers token throughput, latency percentiles.

LB
Luca Berton
Β· 2 min read

Before putting an LLM into production, you need to answer: how many concurrent users can this deployment handle? What’s the P99 latency? What happens under load? NVIDIA’s GenAI-Perf is the standard tool for answering these questions against vLLM, TensorRT-LLM, and NIM endpoints.

What GenAI-Perf Measures

GenAI-Perf generates synthetic inference requests and measures:

  • Time to First Token (TTFT) β€” how long until the first token appears
  • Inter-Token Latency (ITL) β€” time between subsequent tokens
  • Output Token Throughput β€” tokens/second across all concurrent requests
  • Request Throughput β€” completed requests/second
  • End-to-End Latency β€” total time per request (P50, P90, P99)

Installation

GenAI-Perf is part of the NVIDIA Triton Inference Server SDK:

pip install genai-perf

# Or from the NGC container
docker run --rm -it \
  nvcr.io/nvidia/tritonserver:26.02-py3-sdk \
  bash

Basic Benchmark: Single GPU vLLM

# Deploy vLLM endpoint first
# vLLM serving on port 8000 with OpenAI-compatible API

genai-perf profile \
  --model mistral-small \
  --backend vllm \
  --endpoint-type chat \
  --url http://inference-endpoint:8000 \
  --concurrency 1 \
  --input-tokens-mean 200 \
  --output-tokens-mean 100 \
  --num-prompts 100 \
  --streaming

Output:

                         LLM Metrics
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric                    β”‚   P50  β”‚   P90  β”‚   P99  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time to First Token (ms)  β”‚   45   β”‚   62   β”‚   89   β”‚
β”‚ Inter Token Latency (ms)  β”‚   12   β”‚   15   β”‚   21   β”‚
β”‚ Request Latency (ms)      β”‚  1245  β”‚  1580  β”‚  1890  β”‚
β”‚ Output Token Throughput   β”‚   82 tokens/s              β”‚
β”‚ Request Throughput        β”‚  0.79 req/s                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scaling Concurrency

The key question: how does performance degrade as concurrent users increase?

# Test with increasing concurrency
for concurrency in 1 2 4 8 16 32; do
  echo "=== Concurrency: $concurrency ==="
  genai-perf profile \
    --model mistral-small \
    --backend vllm \
    --endpoint-type chat \
    --url http://inference-endpoint:8000 \
    --concurrency $concurrency \
    --input-tokens-mean 200 \
    --output-tokens-mean 100 \
    --num-prompts 200 \
    --streaming
done

Typical results for a single H100 with Mistral Small:

ConcurrencyThroughput (tok/s)TTFT P50TTFT P99ITL P99
18245ms89ms21ms
215848ms95ms23ms
429555ms120ms28ms
851072ms180ms35ms
16780110ms350ms52ms
32920250ms800ms88ms

The sweet spot is usually where throughput plateaus but latency remains acceptable for your SLA.

Multi-Location Testing

In enterprise environments, test from multiple network paths to understand the impact of network topology:

# Test from 3 different locations:
# 1. Same cluster (pod-to-pod)
# 2. Admin node (behind HA proxy/load balancer)
# 3. User laptop (through ingress + TLS)

# Location 1: Cluster-internal
genai-perf profile \
  --url http://vllm-service.ai-namespace.svc:8000 \
  --concurrency 8 \
  --input-tokens-mean 200 \
  --output-tokens-mean 100

# Location 2: Admin node (via HA Proxy)
genai-perf profile \
  --url https://inference.internal.company.com \
  --concurrency 8 \
  --input-tokens-mean 200 \
  --output-tokens-mean 100

# Location 3: External (via ingress)
genai-perf profile \
  --url https://api.company.com/v1 \
  --concurrency 8 \
  --input-tokens-mean 200 \
  --output-tokens-mean 100

Expected latency differences:

LocationNetwork HopsTTFT Overhead
Cluster-internal0 (pod-to-pod)+0ms
Admin node (HA Proxy)2-3 hops+5-15ms
External (ingress + TLS)4-6 hops+20-50ms

Multi-GPU Inference Benchmarking

For distributed inference with tensor parallelism:

# Benchmark with higher concurrency to saturate multi-GPU deployment
genai-perf profile \
  --model mistral-small \
  --backend vllm \
  --endpoint-type chat \
  --url http://inference-endpoint:8000 \
  --concurrency 16 \
  --input-tokens-mean 200 \
  --output-tokens-mean 200 \
  --num-prompts 500 \
  --streaming \
  --request-rate 50  # Target 50 req/s

Compare single GPU vs multi-GPU:

SetupMax ThroughputTTFT P99 @ maxCost/token
1Γ— H100 (TP=1)920 tok/s800ms$0.42/1M
2Γ— H100 (TP=2)1,650 tok/s450ms$0.46/1M
4Γ— H100 (TP=4)2,800 tok/s280ms$0.54/1M

More GPUs reduce latency and increase throughput, but cost per token increases due to communication overhead.

Advanced: Input Token Distribution

Real-world traffic has variable prompt lengths. Test with realistic distributions:

# Short prompts (chatbot-style)
genai-perf profile \
  --input-tokens-mean 100 \
  --input-tokens-stddev 50 \
  --output-tokens-mean 150 \
  --output-tokens-stddev 100

# Long prompts (RAG with context)
genai-perf profile \
  --input-tokens-mean 2000 \
  --input-tokens-stddev 500 \
  --output-tokens-mean 200 \
  --output-tokens-stddev 50

# Code generation (long output)
genai-perf profile \
  --input-tokens-mean 500 \
  --input-tokens-stddev 200 \
  --output-tokens-mean 1000 \
  --output-tokens-stddev 300

Interpreting Results for Capacity Planning

SLA-Based Capacity

If your SLA requires TTFT under 500ms at P99:

  1. Run concurrency sweep (1 β†’ 64)
  2. Find the concurrency where TTFT P99 = 500ms (e.g., concurrency=24)
  3. Measure throughput at that concurrency (e.g., 850 tok/s)
  4. Calculate: users_supported = throughput / avg_output_tokens_per_request
  5. Example: 850 / 150 = ~5.6 simultaneous requests

Cost Optimization

# Compare different quantization levels
# FP16 baseline
genai-perf profile --model mistral-fp16 --concurrency 8

# FP8 quantized
genai-perf profile --model mistral-fp8 --concurrency 8

# INT4 (AWQ/GPTQ)
genai-perf profile --model mistral-awq --concurrency 8

Typical quantization impact:

PrecisionThroughputQuality (MMLU)Memory
FP16920 tok/s82.1%24 GB
FP81,350 tok/s81.8%12 GB
INT4 (AWQ)1,800 tok/s80.5%7 GB

Automation: CI/CD Performance Regression

# GitHub Actions: run perf test on every model update
- name: Benchmark inference
  run: |
    genai-perf profile \
      --model ${{ env.MODEL_NAME }} \
      --backend vllm \
      --concurrency 8 \
      --input-tokens-mean 200 \
      --output-tokens-mean 100 \
      --num-prompts 100 \
      --output-format json \
      --output-file results.json

- name: Check regression
  run: |
    python scripts/check_perf_regression.py \
      --baseline baseline.json \
      --current results.json \
      --ttft-p99-threshold 500 \
      --throughput-min 800

Benchmark before you deploy, benchmark after you optimize, benchmark when traffic patterns change. GenAI-Perf makes this a 5-minute task instead of a week-long project.

Free 30-min AI & Cloud consultation

Book Now