Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
NVIDIA DOCA Perftest RDMA benchmarking guide for BlueField DPUs
AI

NVIDIA DOCA Perftest: RDMA Benchmarking Guide

How to use DOCA Perftest to benchmark RDMA performance on BlueField-3 DPUs. Covers CLI and JSON modes, GPU memory types, traffic patterns, bandwidth and.

LB
Luca Berton
Β· 3 min read

What Is DOCA Perftest?

DOCA Perftest is NVIDIA’s next-generation RDMA benchmarking tool, built on the DOCA SDK. It replaces the traditional perftest suite (ib_write_bw, ib_read_lat, etc.) with a unified binary that supports BlueField-3 DPU features, GPU memory types, and JSON-structured output for automation.

If you run InfiniBand or RoCE networks with BlueField DPUs, DOCA Perftest is the tool you need for validating RDMA throughput and latency before deploying AI training or inference workloads.

Why Not Just Use Traditional Perftest?

Traditional perftest works. It has been the standard RDMA benchmark for over a decade. But it lacks:

  • GPU memory support β€” cannot benchmark GPU Direct RDMA (GDR) or GPU memory types natively
  • JSON output β€” only human-readable text, hard to parse in CI/CD pipelines
  • BlueField-specific features β€” no DPU offload testing, no DOCA integration
  • Complex traffic patterns β€” limited to simple client-server pairs

DOCA Perftest addresses all of these. It is a single binary with two modes: CLI (familiar perftest-like flags) and JSON (structured configuration files).

Installation

DOCA Perftest is included in the DOCA SDK. On a BlueField-3 DPU or a host with DOCA installed:

# Check if doca_perftest is available
which doca_perftest

# If not, install the DOCA tools package
sudo apt install doca-tools    # Ubuntu/Debian
sudo dnf install doca-tools    # RHEL/Rocky

Verify the version:

doca_perftest --version

CLI Mode: Quick Bandwidth Test

The CLI mode mirrors traditional perftest syntax. Start a server on one node and a client on another.

Server

doca_perftest --test=write_bw --device=mlx5_0 --server

Client

doca_perftest --test=write_bw --device=mlx5_0 \
  --server-ip=192.168.1.100 \
  --msg-size=65536 \
  --num-iters=10000

Available Test Types

TestDescription
write_bwRDMA Write bandwidth
write_latRDMA Write latency
read_bwRDMA Read bandwidth
read_latRDMA Read latency
send_bwSend/Receive bandwidth
send_latSend/Receive latency

Key CLI Flags

# Specify message size (bytes)
--msg-size=4096

# Number of iterations
--num-iters=5000

# Number of QPs (queue pairs)
--num-qps=4

# Connection type
--connection-type=RC    # Reliable Connected (default)
--connection-type=UC    # Unreliable Connected
--connection-type=UD    # Unreliable Datagram

# Port number
--port=18515

# GID index for RoCE
--gid-index=3

# Inline data threshold
--inline-size=64

JSON Mode: Structured Configuration

JSON mode is where DOCA Perftest shines for automation. Define your test in a JSON file:

{
  "test": {
    "type": "write_bw",
    "device": "mlx5_0",
    "connection": {
      "type": "RC",
      "num_qps": 4
    },
    "traffic": {
      "msg_size": 65536,
      "num_iters": 10000,
      "sl": 0,
      "inline_size": 0
    },
    "output": {
      "format": "json",
      "file": "/tmp/results.json"
    }
  }
}

Run with:

# Server
doca_perftest --json=config.json --server

# Client
doca_perftest --json=config.json --server-ip=192.168.1.100

The JSON output file contains structured results:

{
  "test": "write_bw",
  "results": {
    "bandwidth_gbps": 198.4,
    "bandwidth_msg_rate": 378000,
    "msg_size": 65536,
    "num_iters": 10000
  },
  "latency": {
    "avg_usec": 1.23,
    "p50_usec": 1.15,
    "p99_usec": 2.45,
    "max_usec": 12.8
  }
}

This is easy to parse in Python, Grafana, or CI/CD systems for automated performance regression testing.

GPU Memory Types

DOCA Perftest supports benchmarking different memory types β€” critical for validating GPU Direct RDMA paths:

# Host memory (default)
doca_perftest --test=write_bw --memory-type=host

# GPU memory (GPU Direct RDMA)
doca_perftest --test=write_bw --memory-type=gpu --gpu-id=0

# Managed memory (CUDA unified)
doca_perftest --test=write_bw --memory-type=managed --gpu-id=0

# Device memory on BlueField DPU
doca_perftest --test=write_bw --memory-type=dpu

Why GPU Memory Testing Matters

When training large models across multiple nodes, data flows through this path:

GPU β†’ NVLink β†’ NVSwitch β†’ PCIe β†’ NIC β†’ Network β†’ NIC β†’ PCIe β†’ NVSwitch β†’ NVLink β†’ GPU

With GPU Direct RDMA (GDR), the NIC reads/writes directly to GPU memory, bypassing the CPU and host memory entirely:

GPU β†’ PCIe β†’ NIC β†’ Network β†’ NIC β†’ PCIe β†’ GPU

DOCA Perftest with --memory-type=gpu benchmarks this GDR path. If you see significantly lower bandwidth with GPU memory compared to host memory, your GDR path is not working correctly β€” check PCIe ACS settings, IOMMU configuration, and GPU Direct kernel modules.

Traffic Patterns

Bidirectional Bandwidth

doca_perftest --test=write_bw --bidirectional \
  --device=mlx5_0 --server-ip=192.168.1.100

Multi-QP Scaling

Test how bandwidth scales with multiple queue pairs:

for qps in 1 2 4 8 16; do
  echo "Testing with $qps QPs..."
  doca_perftest --test=write_bw --num-qps=$qps \
    --device=mlx5_0 --server-ip=192.168.1.100 \
    --msg-size=65536 --num-iters=5000
done

Message Size Sweep

for size in 64 512 4096 65536 1048576; do
  doca_perftest --test=write_bw --msg-size=$size \
    --device=mlx5_0 --server-ip=192.168.1.100 \
    --num-iters=5000
done

Latency Testing

For latency-sensitive workloads (inference, real-time systems):

# RDMA Write latency
doca_perftest --test=write_lat --device=mlx5_0 \
  --server-ip=192.168.1.100 --msg-size=4 --num-iters=50000

# Expected results on ConnectX-7 / BlueField-3:
# P50: ~0.8-1.2 usec
# P99: ~1.5-2.5 usec

Latency vs Bandwidth Trade-off

Message SizeDominated byTypical Metric
under 4 KBLatencyP50/P99 latency
4 KB - 64 KBTransitionBoth matter
over 64 KBBandwidthGbps throughput

For AI training (NCCL AllReduce), large messages dominate β€” bandwidth is the primary concern. For inference (KV cache transfer), small messages matter β€” latency is critical.

SLURM Integration

In HPC clusters managed by SLURM, run DOCA Perftest across node pairs:

#!/bin/bash
#SBATCH --job-name=rdma-bench
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --partition=gpu

NODES=($(scontrol show hostname $SLURM_JOB_NODELIST))
SERVER=${NODES[0]}
CLIENT=${NODES[1]}

# Start server
srun --nodes=1 --nodelist=$SERVER \
  doca_perftest --test=write_bw --device=mlx5_0 --server &

sleep 2

# Start client
srun --nodes=1 --nodelist=$CLIENT \
  doca_perftest --test=write_bw --device=mlx5_0 \
  --server-ip=$SERVER --msg-size=65536 --num-iters=10000

wait

All-to-All Benchmarking

For multi-node AI clusters, benchmark all node pairs:

#!/bin/bash
NODES=(node01 node02 node03 node04)

for i in "${!NODES[@]}"; do
  for j in "${!NODES[@]}"; do
    if [ $i -lt $j ]; then
      echo "Testing ${NODES[$i]} <-> ${NODES[$j]}"
      ssh ${NODES[$i]} "doca_perftest --test=write_bw --device=mlx5_0 --server" &
      sleep 1
      ssh ${NODES[$j]} "doca_perftest --test=write_bw --device=mlx5_0 \
        --server-ip=${NODES[$i]} --msg-size=65536 --num-iters=5000"
      kill %1 2>/dev/null
    fi
  done
done

Expected Performance on BlueField-3

MetricConnectX-7 400GBlueField-3 400G
Write BW (unidirectional)~395 Gbps~395 Gbps
Write BW (bidirectional)~780 Gbps~780 Gbps
Write latency (4B)~0.9 usec~1.1 usec
GDR Write BW~380 Gbps~380 Gbps

If your numbers are significantly below these, check:

  1. PFC configuration β€” see Enable PFC on Mellanox ConnectX NICs
  2. PCIe width β€” lspci -vvs should show x16 Gen5
  3. NUMA affinity β€” ensure NIC and GPU are on the same NUMA node
  4. MTU β€” InfiniBand uses 4096 by default; RoCE should use 9000 (jumbo frames)

Automating Performance Regression Testing

Combine JSON mode with CI/CD for automated network validation:

#!/usr/bin/env python3
import json
import subprocess
import sys

# Run DOCA Perftest with JSON output
result = subprocess.run([
    "doca_perftest", "--test=write_bw",
    "--device=mlx5_0", "--server-ip=192.168.1.100",
    "--msg-size=65536", "--num-iters=5000",
    "--output-format=json", "--output-file=/tmp/result.json"
], capture_output=True, text=True)

with open("/tmp/result.json") as f:
    data = json.load(f)

bw = data["results"]["bandwidth_gbps"]
print(f"Bandwidth: {bw} Gbps")

# Fail if bandwidth is below threshold
THRESHOLD = 350.0  # Gbps
if bw < THRESHOLD:
    print(f"FAIL: {bw} Gbps below threshold {THRESHOLD} Gbps")
    sys.exit(1)

print("PASS: Bandwidth within expected range")

Building AI infrastructure with BlueField DPUs? I help enterprises design, benchmark, and optimize RDMA networking for GPU clusters running AI training and inference workloads.

Book an Infrastructure Assessment β†’

Free 30-min AI & Cloud consultation

Book Now