🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

Dell PowerScale + NVIDIA GPUDirect Storage Guide

Luca Berton •
#dell#powerscale#onefs#nvidia#gpudirect-storage#gds#nfs#rdma#gpu-io#storage

Modern AI training, inference, and analytics pipelines don’t just need fast GPUs—they need a fast, predictable way to feed those GPUs with data. As datasets grow, the “time to first batch” and the steady-state throughput of reading training shards, feature stores, embeddings, or simulation outputs can become the real limiter. That’s the problem Dell Technologies and NVIDIA are addressing with the combination of Dell’s scale-out NAS and NVIDIA’s direct-to-GPU storage path.

The core idea: stop bouncing data through CPU memory

NVIDIA GPUDirect Storage (GDS) creates a direct DMA data path between storage and GPU memory, avoiding “bounce buffers” in CPU system memory. The practical impact is higher effective bandwidth, lower I/O latency, and less CPU utilization—especially valuable when your CPUs are already busy with networking, preprocessing, orchestration, or running multiple GPU jobs per node. (NVIDIA Docs)


What Dell PowerScale brings to the table

Dell PowerScale is a scale-out NAS platform designed to grow performance and capacity by adding nodes. In the performance report, Dell’s message is straightforward: as GPU-accelerated analytics and model training intensify I/O demands, storage must scale linearly and stay consistent under load. Dell positions PowerScale (with OneFS) as that elastic back-end—especially when paired with high-speed networking and RDMA-enabled access paths. (Dell Technologies Info Hub)


How GDS works (in the parts you actually feel)

GDS is delivered via NVIDIA’s Magnum IO stack and is typically used through cuFile APIs (or through frameworks/libraries that integrate them). The model is:

NVIDIA emphasizes that this “explicit, proactive” approach avoids overhead from reactive paging/faulting patterns and can deliver the biggest benefit when your pipeline is GPU-first (GPU is the first/last to touch the data). (NVIDIA Docs)

A key detail: traditionally, direct transfers have relied on opening files with O_DIRECT (plus alignment requirements), though NVIDIA notes newer releases can still take the GDS-driven path in more cases when buffers/offsets are aligned. (NVIDIA Docs)


The integration pattern: PowerScale + NFS over RDMA + GDS

Dell’s report focuses on a concrete, common architecture for GPU clusters:

Dell states that the testing demonstrates PowerScale OneFS with NFSoRDMA is compatible and supported by NVIDIA GDS, and that the platform scales to meet growth demands. (Dell Technologies Info Hub)

From NVIDIA’s side, the Release Notes explicitly list Dell’s platform in the ecosystem of third-party storage solutions where GDS is available, and the support matrix includes a PowerScale entry (e.g., PowerScale 9.2.0.0 paired with early GDS versions). (NVIDIA Docs)


What the Dell performance report actually found

Dell used NVIDIA’s gdsio utility (included in the GDS package) to drive the workload and measure performance. (Dell Technologies Info Hub)

Key results (high signal)

In the reported GDSIO sequential read tests (512 KiB I/O size, multiple threads per GPU, large file sizes), Dell highlights:

Configuration choices that matter (and why)

To keep results focused on raw I/O and reduce background effects, Dell disabled or tuned several storage features during testing:

On the compute side, Dell documents GDS installation/validation, plus the use of dual 100 Gbps NICs and explicit mapping of mount points to specific PowerScale front-end IPs—aimed at maximizing throughput and avoiding hot spots. (Dell Technologies Info Hub)


Don’t ignore topology: NUMA + PCIe “hops” can make or break GDS

A practical insight from the report is that GPUDirect-style benefits depend heavily on where your GPUs and NICs sit in the PCIe/NUMA topology.

Dell stresses limiting the number of “hops” between GPU and NIC, grouping GPUs and NICs by NUMA affinity, and using tools like nvidia-smi topo (and classic Linux tools like lspci) to understand the layout. (Dell Technologies Info Hub)

This is one of the easiest ways to lose performance “mysteriously”: the storage path might be fast, but traffic is bouncing across sockets/interconnects before it even reaches the NIC that talks to the storage.


A minimal “what this looks like” checklist

Below is a distilled version of the operational pattern implied by the Dell + NVIDIA docs:

  1. Confirm GDS sees your filesystem path as supported

  2. Use NFSoRDMA and consistent mount mapping

    • Dell mounts NFS with RDMA options (proto=rdma, NFSv3, tuned rsize/wsize) and maps mounts to specific front-end IPs. (Dell Technologies Info Hub)
  3. Align data path with topology

Example (illustrative, based on Dell’s approach):

mount -o proto=rdma,port=20049,vers=3,rsize=524288,wsize=524288 \
  <powerscale_frontend_ip>:/ifs/benchmark /mnt/f600_gdsio1

(Dell Technologies Info Hub)


Where this combo shines

Putting it together, PowerScale + GDS is most compelling when:

In that scenario, GDS reduces wasted copies and CPU overhead, while PowerScale provides the scale-out throughput and namespace that big GPU farms tend to demand. (NVIDIA Docs)

← Back to Blog