What Is DOCA Perftest?
DOCA Perftest is NVIDIAβs next-generation RDMA benchmarking tool, built on the DOCA SDK. It replaces the traditional perftest suite (ib_write_bw, ib_read_lat, etc.) with a unified binary that supports BlueField-3 DPU features, GPU memory types, and JSON-structured output for automation.
If you run InfiniBand or RoCE networks with BlueField DPUs, DOCA Perftest is the tool you need for validating RDMA throughput and latency before deploying AI training or inference workloads.
Why Not Just Use Traditional Perftest?
Traditional perftest works. It has been the standard RDMA benchmark for over a decade. But it lacks:
- GPU memory support β cannot benchmark GPU Direct RDMA (GDR) or GPU memory types natively
- JSON output β only human-readable text, hard to parse in CI/CD pipelines
- BlueField-specific features β no DPU offload testing, no DOCA integration
- Complex traffic patterns β limited to simple client-server pairs
DOCA Perftest addresses all of these. It is a single binary with two modes: CLI (familiar perftest-like flags) and JSON (structured configuration files).
Installation
DOCA Perftest is included in the DOCA SDK. On a BlueField-3 DPU or a host with DOCA installed:
# Check if doca_perftest is available
which doca_perftest
# If not, install the DOCA tools package
sudo apt install doca-tools # Ubuntu/Debian
sudo dnf install doca-tools # RHEL/RockyVerify the version:
doca_perftest --versionCLI Mode: Quick Bandwidth Test
The CLI mode mirrors traditional perftest syntax. Start a server on one node and a client on another.
Server
doca_perftest --test=write_bw --device=mlx5_0 --serverClient
doca_perftest --test=write_bw --device=mlx5_0 \
--server-ip=192.168.1.100 \
--msg-size=65536 \
--num-iters=10000Available Test Types
| Test | Description |
|---|---|
write_bw | RDMA Write bandwidth |
write_lat | RDMA Write latency |
read_bw | RDMA Read bandwidth |
read_lat | RDMA Read latency |
send_bw | Send/Receive bandwidth |
send_lat | Send/Receive latency |
Key CLI Flags
# Specify message size (bytes)
--msg-size=4096
# Number of iterations
--num-iters=5000
# Number of QPs (queue pairs)
--num-qps=4
# Connection type
--connection-type=RC # Reliable Connected (default)
--connection-type=UC # Unreliable Connected
--connection-type=UD # Unreliable Datagram
# Port number
--port=18515
# GID index for RoCE
--gid-index=3
# Inline data threshold
--inline-size=64JSON Mode: Structured Configuration
JSON mode is where DOCA Perftest shines for automation. Define your test in a JSON file:
{
"test": {
"type": "write_bw",
"device": "mlx5_0",
"connection": {
"type": "RC",
"num_qps": 4
},
"traffic": {
"msg_size": 65536,
"num_iters": 10000,
"sl": 0,
"inline_size": 0
},
"output": {
"format": "json",
"file": "/tmp/results.json"
}
}
}Run with:
# Server
doca_perftest --json=config.json --server
# Client
doca_perftest --json=config.json --server-ip=192.168.1.100The JSON output file contains structured results:
{
"test": "write_bw",
"results": {
"bandwidth_gbps": 198.4,
"bandwidth_msg_rate": 378000,
"msg_size": 65536,
"num_iters": 10000
},
"latency": {
"avg_usec": 1.23,
"p50_usec": 1.15,
"p99_usec": 2.45,
"max_usec": 12.8
}
}This is easy to parse in Python, Grafana, or CI/CD systems for automated performance regression testing.
GPU Memory Types
DOCA Perftest supports benchmarking different memory types β critical for validating GPU Direct RDMA paths:
# Host memory (default)
doca_perftest --test=write_bw --memory-type=host
# GPU memory (GPU Direct RDMA)
doca_perftest --test=write_bw --memory-type=gpu --gpu-id=0
# Managed memory (CUDA unified)
doca_perftest --test=write_bw --memory-type=managed --gpu-id=0
# Device memory on BlueField DPU
doca_perftest --test=write_bw --memory-type=dpuWhy GPU Memory Testing Matters
When training large models across multiple nodes, data flows through this path:
GPU β NVLink β NVSwitch β PCIe β NIC β Network β NIC β PCIe β NVSwitch β NVLink β GPUWith GPU Direct RDMA (GDR), the NIC reads/writes directly to GPU memory, bypassing the CPU and host memory entirely:
GPU β PCIe β NIC β Network β NIC β PCIe β GPUDOCA Perftest with --memory-type=gpu benchmarks this GDR path. If you see significantly lower bandwidth with GPU memory compared to host memory, your GDR path is not working correctly β check PCIe ACS settings, IOMMU configuration, and GPU Direct kernel modules.
Traffic Patterns
Bidirectional Bandwidth
doca_perftest --test=write_bw --bidirectional \
--device=mlx5_0 --server-ip=192.168.1.100Multi-QP Scaling
Test how bandwidth scales with multiple queue pairs:
for qps in 1 2 4 8 16; do
echo "Testing with $qps QPs..."
doca_perftest --test=write_bw --num-qps=$qps \
--device=mlx5_0 --server-ip=192.168.1.100 \
--msg-size=65536 --num-iters=5000
doneMessage Size Sweep
for size in 64 512 4096 65536 1048576; do
doca_perftest --test=write_bw --msg-size=$size \
--device=mlx5_0 --server-ip=192.168.1.100 \
--num-iters=5000
doneLatency Testing
For latency-sensitive workloads (inference, real-time systems):
# RDMA Write latency
doca_perftest --test=write_lat --device=mlx5_0 \
--server-ip=192.168.1.100 --msg-size=4 --num-iters=50000
# Expected results on ConnectX-7 / BlueField-3:
# P50: ~0.8-1.2 usec
# P99: ~1.5-2.5 usecLatency vs Bandwidth Trade-off
| Message Size | Dominated by | Typical Metric |
|---|---|---|
| under 4 KB | Latency | P50/P99 latency |
| 4 KB - 64 KB | Transition | Both matter |
| over 64 KB | Bandwidth | Gbps throughput |
For AI training (NCCL AllReduce), large messages dominate β bandwidth is the primary concern. For inference (KV cache transfer), small messages matter β latency is critical.
SLURM Integration
In HPC clusters managed by SLURM, run DOCA Perftest across node pairs:
#!/bin/bash
#SBATCH --job-name=rdma-bench
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --partition=gpu
NODES=($(scontrol show hostname $SLURM_JOB_NODELIST))
SERVER=${NODES[0]}
CLIENT=${NODES[1]}
# Start server
srun --nodes=1 --nodelist=$SERVER \
doca_perftest --test=write_bw --device=mlx5_0 --server &
sleep 2
# Start client
srun --nodes=1 --nodelist=$CLIENT \
doca_perftest --test=write_bw --device=mlx5_0 \
--server-ip=$SERVER --msg-size=65536 --num-iters=10000
waitAll-to-All Benchmarking
For multi-node AI clusters, benchmark all node pairs:
#!/bin/bash
NODES=(node01 node02 node03 node04)
for i in "${!NODES[@]}"; do
for j in "${!NODES[@]}"; do
if [ $i -lt $j ]; then
echo "Testing ${NODES[$i]} <-> ${NODES[$j]}"
ssh ${NODES[$i]} "doca_perftest --test=write_bw --device=mlx5_0 --server" &
sleep 1
ssh ${NODES[$j]} "doca_perftest --test=write_bw --device=mlx5_0 \
--server-ip=${NODES[$i]} --msg-size=65536 --num-iters=5000"
kill %1 2>/dev/null
fi
done
doneExpected Performance on BlueField-3
| Metric | ConnectX-7 400G | BlueField-3 400G |
|---|---|---|
| Write BW (unidirectional) | ~395 Gbps | ~395 Gbps |
| Write BW (bidirectional) | ~780 Gbps | ~780 Gbps |
| Write latency (4B) | ~0.9 usec | ~1.1 usec |
| GDR Write BW | ~380 Gbps | ~380 Gbps |
If your numbers are significantly below these, check:
- PFC configuration β see Enable PFC on Mellanox ConnectX NICs
- PCIe width β
lspci -vvsshould show x16 Gen5 - NUMA affinity β ensure NIC and GPU are on the same NUMA node
- MTU β InfiniBand uses 4096 by default; RoCE should use 9000 (jumbo frames)
Automating Performance Regression Testing
Combine JSON mode with CI/CD for automated network validation:
#!/usr/bin/env python3
import json
import subprocess
import sys
# Run DOCA Perftest with JSON output
result = subprocess.run([
"doca_perftest", "--test=write_bw",
"--device=mlx5_0", "--server-ip=192.168.1.100",
"--msg-size=65536", "--num-iters=5000",
"--output-format=json", "--output-file=/tmp/result.json"
], capture_output=True, text=True)
with open("/tmp/result.json") as f:
data = json.load(f)
bw = data["results"]["bandwidth_gbps"]
print(f"Bandwidth: {bw} Gbps")
# Fail if bandwidth is below threshold
THRESHOLD = 350.0 # Gbps
if bw < THRESHOLD:
print(f"FAIL: {bw} Gbps below threshold {THRESHOLD} Gbps")
sys.exit(1)
print("PASS: Bandwidth within expected range")Building AI infrastructure with BlueField DPUs? I help enterprises design, benchmark, and optimize RDMA networking for GPU clusters running AI training and inference workloads.
Book an Infrastructure Assessment β