xaskasdf/gpu-nvme-direct

gpu-nvme-direct

GPU-Initiated NVMe I/O on Consumer Hardware

Samuel Cortes — naranjositos.tech — February 2026

3,350 MB/s sustained 99% PCIe link utilization Zero CPU in data path

Abstract

We present gpu-nvme-direct, a system that enables a consumer GPU (NVIDIA GeForce RTX 3090) to autonomously initiate NVMe storage I/O operations via PCIe MMIO, entirely without CPU involvement in the data path. Unlike prior work that requires enterprise GPUs with native peer-to-peer (P2P) DMA support, gpu-nvme-direct operates on commodity hardware (AMD Ryzen 5800X, B450 motherboard) costing approximately $2,000.

We evaluate on two NVMe devices: a WD SN530 (PCIe 3.0 x4), where gpu-nvme-direct achieves 2,666 MB/s (78% of link bandwidth), and a WD SN740 (PCIe 4.0 x4), where it reaches 3,350 MB/s sustained (99% of link bandwidth). Integrated into a streaming inference engine, gpu-nvme-direct achieves 0.06 tok/s for Llama 70B from NVMe alone, and 0.50 tok/s when combined with tiered caching on a single RTX 3090.

Key Results

3,350
MB/s sustained (SN740)
99%
PCIe link utilization
2.2x
over CPU baselines
0.50
tok/s Llama 70B

Architecture

GPU (RTX 3090) NVMe (SN740) +-------------+ +-----------+ | CUDA kernel | ==MMIO==> | BAR0 | | (1 thread) | doorbell | registers | | sq_submit | writes | | | cq_poll | +-----------+ +------+------+ | | | DMA v v +------+------+ +-----------+ | Host pinned | <==DMA== | NVMe DMA | | memory | data | engine | | SQ/CQ/Data | | | +-------------+ +-----------+

The GPU writes NVMe doorbell registers via PCIe posted MMIO writes (PTX st.relaxed.mmio.sys). The CPU is not involved in the I/O data path. This works on AMD consumer platforms where P2P reads fail, because only posted writes are needed to drive an NVMe controller.

Throughput Comparison (QD=32)

Block Sizegpu-nvme-directCPU memcpyCPU pinnedcuFile (GDS)
4 KB336 MB/s180153151
16 KB1,393 MB/s626516510
64 KB2,111 MB/s1,3731,1521,144
256 KB2,666 MB/s1,7311,4091,396
512 KB2,634 MB/s2,0081,7001,696

Note: gpu-nvme-direct uses the slower SN530 (Gen3 x4). CPU baselines use the faster 980 PRO (Gen4 x4). Despite this 2x device asymmetry, gpu-nvme-direct achieves higher throughput at all block sizes.

LLM Inference

ConfigurationBWtok/s
mmap+memcpy (baseline)1.5 GB/s0.028
gpu-nvme-direct NVMe-only (SN530)2.1 GB/s0.04
gpu-nvme-direct NVMe-only (SN740)3.35 GB/s0.06
Tiered VRAM/RAM (no NVMe)6.5 GB/s0.20
Tiered + layer skip6.5 GB/s0.27
Tiered + Q4_K_M + skip6.5 GB/s0.50

Llama 3.1-70B, Q6_K quantization (~42 GB, 80 layers), single RTX 3090.

Hardware

ComponentDetail
GPUNVIDIA RTX 3090 (GA102, 24 GB, PCIe 3.0 x8 on B450)
CPUAMD Ryzen 7 5800X (8C/16T)
PlatformASUS ROG STRIX B450-F GAMING II, 48 GB DDR4
NVMe (test)WD SN740 512 GB (Gen4 device, Gen3 on B450)
NVMe (boot)WD SN530 1 TB (Gen3 x4)
OS/CUDAUbuntu 25.10, kernel 6.17, CUDA 13.1

Citation

@article{cortes2026gpunvmedirect, title={gpu-nvme-direct: GPU-Initiated NVMe I/O on Consumer Hardware}, author={Cortes, Samuel}, year={2026}, url={https://github.com/xaskasdf/gpu-nvme-direct} }