Why You Need a Real LLM Benchmarking Tool
βHow fast is your inference endpoint?β is the wrong question. The right questions are:
- What is the Time to First Token at P99 under 50 concurrent users?
- What is the Inter-Token Latency when the KV cache is 80% full?
- At what concurrency does throughput plateau and latency spike?
- Does your endpoint survive Poisson-distributed bursty traffic?
You cannot answer these with curl and a stopwatch. You need AIPerf.
NVIDIA AIPerf (formerly GenAI-Perf) is an open-source, production-grade benchmarking tool for generative AI inference. It measures every metric that matters β TTFT, TTST, ITL, throughput, latency distributions β across any OpenAI-compatible endpoint with realistic traffic patterns, concurrency control, and detailed reporting.
Quick Start: Benchmark in 5 Minutes
1. Set Up a Local Server
# Start Ollama with a small model
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama-data:/root/.ollama \
ollama/ollama:latest
docker exec -it ollama ollama pull granite4:350m2. Install and Run AIPerf
python3 -m venv venv
source venv/bin/activate
pip install aiperf
aiperf profile \
--model "granite4:350m" \
--streaming \
--endpoint-type chat \
--tokenizer ibm-granite/granite-4.0-micro \
--url http://localhost:114343. Read the Results
AIPerf produces a comprehensive metrics table:
| Metric | avg | min | max | p99 | p90 | p50 | std |
|---|---|---|---|---|---|---|---|
| Time to First Token (ms) | 7,463 | 7,126 | 9,484 | 9,295 | 7,597 | 7,240 | 677 |
| Inter Token Latency (ms) | 65.31 | 53.06 | 81.31 | 81.24 | 80.64 | 63.79 | 9.09 |
| Output Token Throughput (tok/s) | 6.85 | β | β | β | β | β | β |
| Request Latency (ms) | 13,829 | 9,029 | 27,905 | 27,238 | 21,228 | 11,338 | 5,614 |
Plus CSV and JSON exports for post-processing.
Architecture: Why AIPerf Is Different
AIPerf is not a simple HTTP load generator. It is a 9-service multiprocess architecture communicating via ZMQ:
βββββββββββββββββββββββββββββββββββββββββββββββ
β AIPerf Orchestrator β
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ€
β Request β Traffic β Response β Metrics β
β Generatorβ Shaper β Collectorβ Aggregator β
ββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββββ€
β Tokenizerβ Dataset β Transportβ Reporter β
β Service β Service β Pool β Service β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββEach service runs in its own process, enabling:
- High concurrency without GIL contention
- Accurate timing β request generation and response collection are decoupled
- Extensibility β plugin system with 25+ extension categories
The Metrics That Matter
Time to First Token (TTFT)
The most important user-facing metric. How long from request submission to the first token appearing?
- Under 500ms: Excellent β users perceive instant response
- 500ms - 2s: Acceptable for most applications
- Over 3s: Users start abandoning
TTFT is dominated by prefill time β processing the entire input prompt before generating the first output token. Longer prompts = higher TTFT.
Time to Second Token (TTST)
Often overlooked but critical for perceived quality. The gap between first and second token reveals whether the model has βwarmed upβ or if there is scheduling overhead.
Inter-Token Latency (ITL)
The time between consecutive output tokens. This determines the streaming speed users experience. At 60ms ITL, text appears at roughly 16 tokens/second β fast enough to feel like smooth typing.
Output Token Throughput
Total tokens generated per second across all concurrent requests. This is your capacity metric β how many users can your endpoint serve simultaneously?
Request Throughput
Requests completed per second. Combined with token throughput, this tells you whether your endpoint is handling many short requests or fewer long ones.
Benchmarking Modes
Concurrency Mode (Default)
Fix the number of concurrent requests and measure performance:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--concurrency 10 \
--request-count 100 \
--streamingRequest Rate Mode
Send requests at a fixed rate (requests/second) regardless of response time:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--request-rate 5 \
--request-count 200Request Rate with Max Concurrency
Dual control β send at a target rate but cap maximum concurrent requests:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--request-rate 10 \
--max-concurrency 20Trace Replay
Replay real production traffic patterns for the most realistic benchmarks:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--trace-file production-traffic.jsonTraffic Patterns: Beyond Constant Load
Real inference traffic is bursty. AIPerf supports multiple arrival patterns:
Poisson Distribution
Models natural request arrivals β the most realistic pattern for web-facing endpoints:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--request-rate 10 \
--arrival-pattern poissonGamma Distribution
Models traffic with more variance than Poisson β useful for simulating enterprise workloads with periodic bursts.
Gradual Ramping
Smooth ramp-up to identify the exact point where latency degrades:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--concurrency-start 1 \
--concurrency-end 50 \
--concurrency-step 5This is how you find your endpointβs breaking point β the concurrency level where P99 latency exceeds your SLA.
Dataset Support
AIPerf supports diverse workload types beyond simple synthetic prompts:
ShareGPT Dataset
Real conversational data from ChatGPT interactions:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--dataset sharegptCustom Prompts
Send your exact production prompts:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--input-file my-prompts.jsonlSynthetic Generation with Sequence Control
Control input/output sequence length distributions:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--input-tokens-mean 512 \
--input-tokens-stddev 100 \
--output-tokens-mean 256Specialized Datasets
- AIMO β Math reasoning (NuminaMath)
- MMStar β Vision language model benchmarks
- MMVU β Video understanding
- InstructCoder β Code generation
- SPEED-Bench β Speculative decoding evaluation
- Agentic Code Generator β Multi-turn coding agent traces for KV cache benchmarking
Supported Endpoints
AIPerf works with any OpenAI-compatible API:
| Endpoint Type | Use Case |
|---|---|
| Chat Completions | Standard LLM chat (vLLM, NIM, TGI, Ollama) |
| Completions | Text completion APIs |
| Embeddings | Embedding model benchmarks |
| Rankings | Reranker model benchmarks |
| Audio | Audio language models |
| Vision | Vision language models (with image inputs) |
| Image Generation | DALL-E compatible APIs |
| Video Generation | SGLang video generation |
| OpenAI Responses API | New Responses API format |
Advanced Features
Warmup Phase
Eliminate cold-start effects from your measurements:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--warmup-requests 10Multi-URL Load Balancing
Distribute across multiple inference servers:
aiperf profile \
--model "llama-3.1-70b" \
--url http://server1:8000 http://server2:8000 http://server3:8000 \
--concurrency 30GPU Telemetry
Collect DCGM metrics alongside inference benchmarks:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--gpu-telemetryThis correlates inference performance with GPU utilization, memory usage, and power consumption.
Goodput (SLO-Based Throughput)
Measure throughput that actually meets your SLA:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--goodput-ttft 500 \
--goodput-itl 100Only requests with TTFT under 500ms AND ITL under 100ms count toward goodput. This is the metric that matters for production β raw throughput is meaningless if half your requests breach SLA.
Multi-Run Confidence Intervals
Run multiple iterations and get statistical confidence:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--runs 5 \
--confidence-level 0.95Request Cancellation Testing
Simulate users abandoning requests β critical for testing inference server resilience:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--request-timeout 5000Prefill Concurrency
Memory-safe benchmarking for long-context workloads β controls how many requests are in the prefill phase simultaneously:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--prefill-concurrency 4 \
--input-tokens-mean 32000UI Modes
Dashboard (Real-Time TUI)
Live terminal dashboard showing metrics updating in real-time:
aiperf profile --ui dashboard ...Simple (Progress Bars)
Minimal progress indicator:
aiperf profile --ui simple ...Headless
No UI β perfect for CI/CD pipelines:
aiperf profile --ui none ...Plugin System
AIPerfβs plugin architecture supports 25+ extension categories:
- Endpoint plugins β Add support for custom inference APIs
- Dataset plugins β Custom data formats and generators
- Transport plugins β Custom HTTP clients or protocols
- Metrics plugins β Additional metric calculations
Create custom plugins by implementing the plugin interface β no core code changes required.
Practical Benchmarking Workflow
Here is the workflow I use when evaluating inference infrastructure for enterprise deployments:
Step 1: Baseline (Single User)
aiperf profile --model "llama-3.1-70b" --url $URL \
--concurrency 1 --request-count 50 --streamingEstablishes minimum latency and maximum per-request quality.
Step 2: Find the Breaking Point
aiperf profile --model "llama-3.1-70b" --url $URL \
--concurrency-start 1 --concurrency-end 100 --concurrency-step 10 \
--request-count 200 --streamingIdentifies where P99 TTFT exceeds your SLA threshold.
Step 3: Sustained Load at Target Concurrency
aiperf profile --model "llama-3.1-70b" --url $URL \
--concurrency 30 --duration 300 --streaming \
--arrival-pattern poisson --warmup-requests 20Five minutes of realistic traffic at your target concurrency β reveals memory leaks, GC pauses, and queue buildup.
Step 4: Goodput Validation
aiperf profile --model "llama-3.1-70b" --url $URL \
--concurrency 30 --duration 300 --streaming \
--goodput-ttft 1000 --goodput-itl 80Confirms what percentage of requests actually meet your SLA under sustained load.
AIPerf vs. Other Tools
| Feature | AIPerf | vegeta/wrk | locust | llm-perf |
|---|---|---|---|---|
| LLM-specific metrics (TTFT, ITL) | Yes | No | No | Yes |
| Streaming support | Yes | No | Limited | Yes |
| Traffic patterns (Poisson, gamma) | Yes | Limited | Yes | No |
| GPU telemetry | Yes | No | No | No |
| Goodput (SLO-based) | Yes | No | No | No |
| Multi-node | Yes | No | Yes | No |
| Plugin system | Yes | No | Yes | No |
| Real-time dashboard | Yes | No | Yes | No |
AIPerf is purpose-built for LLM inference. Generic HTTP load testers miss the streaming token-level metrics that define LLM user experience.
The Bottom Line
If you are serving LLMs in production and not benchmarking with a tool that understands streaming tokens, TTFT, ITL, and goodput, you are flying blind. AIPerf gives you the visibility to:
- Right-size GPU allocation based on actual throughput at target latency
- Compare inference engines (vLLM vs TGI vs NIM) on equal terms
- Validate autoscaling by finding the concurrency threshold that triggers scale-up
- Prove SLA compliance with goodput metrics and confidence intervals
- Catch regressions with reproducible benchmarks in CI/CD
It is Apache 2.0 licensed, actively maintained by NVIDIA, and available via pip install aiperf.
Benchmarking your inference infrastructure? I help enterprises evaluate, optimize, and scale LLM serving platforms β from GPU selection to autoscaling configuration.
Book an AI Infrastructure Assessment β
Related Resources
- NVIDIA DOCA Perftest: Modern RDMA Benchmarking for GPU Clusters
- Your Model Does Not Matter. Your Infrastructure Does.
- The Inference Economy: How Venture Is Betting on the Agentic Era
- AI on Kubernetes: Autoscaling Inference Without Burning Money
- NVIDIA NIM Support Matrix: Models and GPUs
- NVIDIA Dynamo: The Inference Framework Built for the Agentic Era