gpu-nvme-direct: GPU-Initiated NVMe I/O on Consumer Hardware

gpu-nvme-direct

GPU-Initiated NVMe I/O on Consumer Hardware

Samuel Cortes — naranjositos.tech — February 2026

3,350 MB/s sustained 99% PCIe link utilization Zero CPU in data path

Download Paper (PDF) Read Online Source Code

Abstract

We present gpu-nvme-direct, a system that enables a consumer GPU (NVIDIA GeForce RTX 3090) to autonomously initiate NVMe storage I/O operations via PCIe MMIO, entirely without CPU involvement in the data path. Unlike prior work that requires enterprise GPUs with native peer-to-peer (P2P) DMA support, gpu-nvme-direct operates on commodity hardware (AMD Ryzen 5800X, B450 motherboard) costing approximately $2,000.

We evaluate on two NVMe devices: a WD SN530 (PCIe 3.0 x4), where gpu-nvme-direct achieves 2,666 MB/s (78% of link bandwidth), and a WD SN740 (PCIe 4.0 x4), where it reaches 3,350 MB/s sustained (99% of link bandwidth). Integrated into a streaming inference engine, gpu-nvme-direct achieves 0.06 tok/s for Llama 70B from NVMe alone, and 0.50 tok/s when combined with tiered caching on a single RTX 3090.

Architecture

The GPU writes NVMe doorbell registers via PCIe posted MMIO writes (PTX st.relaxed.mmio.sys). The CPU is not involved in the I/O data path. This works on AMD consumer platforms where P2P reads fail, because only posted writes are needed to drive an NVMe controller.

Throughput Comparison (QD=32)

Block Size	gpu-nvme-direct	CPU memcpy	CPU pinned	cuFile (GDS)
4 KB	336 MB/s	180	153	151
16 KB	1,393 MB/s	626	516	510
64 KB	2,111 MB/s	1,373	1,152	1,144
256 KB	2,666 MB/s	1,731	1,409	1,396
512 KB	2,634 MB/s	2,008	1,700	1,696

Note: gpu-nvme-direct uses the slower SN530 (Gen3 x4). CPU baselines use the faster 980 PRO (Gen4 x4). Despite this 2x device asymmetry, gpu-nvme-direct achieves higher throughput at all block sizes.

LLM Inference

Configuration	BW	tok/s
mmap+memcpy (baseline)	1.5 GB/s	0.028
gpu-nvme-direct NVMe-only (SN530)	2.1 GB/s	0.04
gpu-nvme-direct NVMe-only (SN740)	3.35 GB/s	0.06
Tiered VRAM/RAM (no NVMe)	6.5 GB/s	0.20
Tiered + layer skip	6.5 GB/s	0.27
Tiered + Q4_K_M + skip	6.5 GB/s	0.50

Llama 3.1-70B, Q6_K quantization (~42 GB, 80 layers), single RTX 3090.

Hardware

Component	Detail
GPU	NVIDIA RTX 3090 (GA102, 24 GB, PCIe 3.0 x8 on B450)
CPU	AMD Ryzen 7 5800X (8C/16T)
Platform	ASUS ROG STRIX B450-F GAMING II, 48 GB DDR4
NVMe (test)	WD SN740 512 GB (Gen4 device, Gen3 on B450)
NVMe (boot)	WD SN530 1 TB (Gen3 x4)
OS/CUDA	Ubuntu 25.10, kernel 6.17, CUDA 13.1

Citation

@article{cortes2026gpunvmedirect, title={gpu-nvme-direct: GPU-Initiated NVMe I/O on Consumer Hardware}, author={Cortes, Samuel}, year={2026}, url={https://github.com/xaskasdf/gpu-nvme-direct} }