Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
NVIDIA AIPerf LLM inference benchmarking dashboard
AI

NVIDIA aiperf: Benchmark LLM Inference (TTFT,

AIPerf is NVIDIA's open-source tool for benchmarking generative AI inference β€” measuring TTFT, ITL, throughput, and latency across vLLM, NIM, TGI, and any.

LB
Luca Berton
Β· 6 min read

Why You Need a Real LLM Benchmarking Tool

β€œHow fast is your inference endpoint?” is the wrong question. The right questions are:

  • What is the Time to First Token at P99 under 50 concurrent users?
  • What is the Inter-Token Latency when the KV cache is 80% full?
  • At what concurrency does throughput plateau and latency spike?
  • Does your endpoint survive Poisson-distributed bursty traffic?

You cannot answer these with curl and a stopwatch. You need AIPerf.

NVIDIA AIPerf (formerly GenAI-Perf) is an open-source, production-grade benchmarking tool for generative AI inference. It measures every metric that matters β€” TTFT, TTST, ITL, throughput, latency distributions β€” across any OpenAI-compatible endpoint with realistic traffic patterns, concurrency control, and detailed reporting.

Quick Start: Benchmark in 5 Minutes

1. Set Up a Local Server

# Start Ollama with a small model
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama/ollama:latest

docker exec -it ollama ollama pull granite4:350m

2. Install and Run AIPerf

python3 -m venv venv
source venv/bin/activate
pip install aiperf

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --tokenizer ibm-granite/granite-4.0-micro \
  --url http://localhost:11434

3. Read the Results

AIPerf produces a comprehensive metrics table:

Metricavgminmaxp99p90p50std
Time to First Token (ms)7,4637,1269,4849,2957,5977,240677
Inter Token Latency (ms)65.3153.0681.3181.2480.6463.799.09
Output Token Throughput (tok/s)6.85β€”β€”β€”β€”β€”β€”
Request Latency (ms)13,8299,02927,90527,23821,22811,3385,614

Plus CSV and JSON exports for post-processing.

Architecture: Why AIPerf Is Different

AIPerf is not a simple HTTP load generator. It is a 9-service multiprocess architecture communicating via ZMQ:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AIPerf Orchestrator                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Request  β”‚ Traffic  β”‚ Response β”‚ Metrics    β”‚
β”‚ Generatorβ”‚ Shaper   β”‚ Collectorβ”‚ Aggregator β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tokenizerβ”‚ Dataset  β”‚ Transportβ”‚ Reporter   β”‚
β”‚ Service  β”‚ Service  β”‚ Pool     β”‚ Service    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each service runs in its own process, enabling:

  • High concurrency without GIL contention
  • Accurate timing β€” request generation and response collection are decoupled
  • Extensibility β€” plugin system with 25+ extension categories

The Metrics That Matter

Time to First Token (TTFT)

The most important user-facing metric. How long from request submission to the first token appearing?

  • Under 500ms: Excellent β€” users perceive instant response
  • 500ms - 2s: Acceptable for most applications
  • Over 3s: Users start abandoning

TTFT is dominated by prefill time β€” processing the entire input prompt before generating the first output token. Longer prompts = higher TTFT.

Time to Second Token (TTST)

Often overlooked but critical for perceived quality. The gap between first and second token reveals whether the model has β€œwarmed up” or if there is scheduling overhead.

Inter-Token Latency (ITL)

The time between consecutive output tokens. This determines the streaming speed users experience. At 60ms ITL, text appears at roughly 16 tokens/second β€” fast enough to feel like smooth typing.

Output Token Throughput

Total tokens generated per second across all concurrent requests. This is your capacity metric β€” how many users can your endpoint serve simultaneously?

Request Throughput

Requests completed per second. Combined with token throughput, this tells you whether your endpoint is handling many short requests or fewer long ones.

Benchmarking Modes

Concurrency Mode (Default)

Fix the number of concurrent requests and measure performance:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --concurrency 10 \
  --request-count 100 \
  --streaming

Request Rate Mode

Send requests at a fixed rate (requests/second) regardless of response time:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --request-rate 5 \
  --request-count 200

Request Rate with Max Concurrency

Dual control β€” send at a target rate but cap maximum concurrent requests:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --request-rate 10 \
  --max-concurrency 20

Trace Replay

Replay real production traffic patterns for the most realistic benchmarks:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --trace-file production-traffic.json

Traffic Patterns: Beyond Constant Load

Real inference traffic is bursty. AIPerf supports multiple arrival patterns:

Poisson Distribution

Models natural request arrivals β€” the most realistic pattern for web-facing endpoints:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --request-rate 10 \
  --arrival-pattern poisson

Gamma Distribution

Models traffic with more variance than Poisson β€” useful for simulating enterprise workloads with periodic bursts.

Gradual Ramping

Smooth ramp-up to identify the exact point where latency degrades:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --concurrency-start 1 \
  --concurrency-end 50 \
  --concurrency-step 5

This is how you find your endpoint’s breaking point β€” the concurrency level where P99 latency exceeds your SLA.

Dataset Support

AIPerf supports diverse workload types beyond simple synthetic prompts:

ShareGPT Dataset

Real conversational data from ChatGPT interactions:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --dataset sharegpt

Custom Prompts

Send your exact production prompts:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --input-file my-prompts.jsonl

Synthetic Generation with Sequence Control

Control input/output sequence length distributions:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --input-tokens-mean 512 \
  --input-tokens-stddev 100 \
  --output-tokens-mean 256

Specialized Datasets

  • AIMO β€” Math reasoning (NuminaMath)
  • MMStar β€” Vision language model benchmarks
  • MMVU β€” Video understanding
  • InstructCoder β€” Code generation
  • SPEED-Bench β€” Speculative decoding evaluation
  • Agentic Code Generator β€” Multi-turn coding agent traces for KV cache benchmarking

Supported Endpoints

AIPerf works with any OpenAI-compatible API:

Endpoint TypeUse Case
Chat CompletionsStandard LLM chat (vLLM, NIM, TGI, Ollama)
CompletionsText completion APIs
EmbeddingsEmbedding model benchmarks
RankingsReranker model benchmarks
AudioAudio language models
VisionVision language models (with image inputs)
Image GenerationDALL-E compatible APIs
Video GenerationSGLang video generation
OpenAI Responses APINew Responses API format

Advanced Features

Warmup Phase

Eliminate cold-start effects from your measurements:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --warmup-requests 10

Multi-URL Load Balancing

Distribute across multiple inference servers:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://server1:8000 http://server2:8000 http://server3:8000 \
  --concurrency 30

GPU Telemetry

Collect DCGM metrics alongside inference benchmarks:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --gpu-telemetry

This correlates inference performance with GPU utilization, memory usage, and power consumption.

Goodput (SLO-Based Throughput)

Measure throughput that actually meets your SLA:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --goodput-ttft 500 \
  --goodput-itl 100

Only requests with TTFT under 500ms AND ITL under 100ms count toward goodput. This is the metric that matters for production β€” raw throughput is meaningless if half your requests breach SLA.

Multi-Run Confidence Intervals

Run multiple iterations and get statistical confidence:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --runs 5 \
  --confidence-level 0.95

Request Cancellation Testing

Simulate users abandoning requests β€” critical for testing inference server resilience:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --request-timeout 5000

Prefill Concurrency

Memory-safe benchmarking for long-context workloads β€” controls how many requests are in the prefill phase simultaneously:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --prefill-concurrency 4 \
  --input-tokens-mean 32000

UI Modes

Dashboard (Real-Time TUI)

Live terminal dashboard showing metrics updating in real-time:

aiperf profile --ui dashboard ...

Simple (Progress Bars)

Minimal progress indicator:

aiperf profile --ui simple ...

Headless

No UI β€” perfect for CI/CD pipelines:

aiperf profile --ui none ...

Plugin System

AIPerf’s plugin architecture supports 25+ extension categories:

  • Endpoint plugins β€” Add support for custom inference APIs
  • Dataset plugins β€” Custom data formats and generators
  • Transport plugins β€” Custom HTTP clients or protocols
  • Metrics plugins β€” Additional metric calculations

Create custom plugins by implementing the plugin interface β€” no core code changes required.

Practical Benchmarking Workflow

Here is the workflow I use when evaluating inference infrastructure for enterprise deployments:

Step 1: Baseline (Single User)

aiperf profile --model "llama-3.1-70b" --url $URL \
  --concurrency 1 --request-count 50 --streaming

Establishes minimum latency and maximum per-request quality.

Step 2: Find the Breaking Point

aiperf profile --model "llama-3.1-70b" --url $URL \
  --concurrency-start 1 --concurrency-end 100 --concurrency-step 10 \
  --request-count 200 --streaming

Identifies where P99 TTFT exceeds your SLA threshold.

Step 3: Sustained Load at Target Concurrency

aiperf profile --model "llama-3.1-70b" --url $URL \
  --concurrency 30 --duration 300 --streaming \
  --arrival-pattern poisson --warmup-requests 20

Five minutes of realistic traffic at your target concurrency β€” reveals memory leaks, GC pauses, and queue buildup.

Step 4: Goodput Validation

aiperf profile --model "llama-3.1-70b" --url $URL \
  --concurrency 30 --duration 300 --streaming \
  --goodput-ttft 1000 --goodput-itl 80

Confirms what percentage of requests actually meet your SLA under sustained load.

AIPerf vs. Other Tools

FeatureAIPerfvegeta/wrklocustllm-perf
LLM-specific metrics (TTFT, ITL)YesNoNoYes
Streaming supportYesNoLimitedYes
Traffic patterns (Poisson, gamma)YesLimitedYesNo
GPU telemetryYesNoNoNo
Goodput (SLO-based)YesNoNoNo
Multi-nodeYesNoYesNo
Plugin systemYesNoYesNo
Real-time dashboardYesNoYesNo

AIPerf is purpose-built for LLM inference. Generic HTTP load testers miss the streaming token-level metrics that define LLM user experience.

The Bottom Line

If you are serving LLMs in production and not benchmarking with a tool that understands streaming tokens, TTFT, ITL, and goodput, you are flying blind. AIPerf gives you the visibility to:

  • Right-size GPU allocation based on actual throughput at target latency
  • Compare inference engines (vLLM vs TGI vs NIM) on equal terms
  • Validate autoscaling by finding the concurrency threshold that triggers scale-up
  • Prove SLA compliance with goodput metrics and confidence intervals
  • Catch regressions with reproducible benchmarks in CI/CD

It is Apache 2.0 licensed, actively maintained by NVIDIA, and available via pip install aiperf.


Benchmarking your inference infrastructure? I help enterprises evaluate, optimize, and scale LLM serving platforms β€” from GPU selection to autoscaling configuration.

Book an AI Infrastructure Assessment β†’

Free 30-min AI & Cloud consultation

Book Now