GPU-Initiated NVMe I/O on Consumer Hardware
We present gpu-nvme-direct, a system that enables a consumer GPU (NVIDIA GeForce RTX 3090) to autonomously initiate NVMe storage I/O operations via PCIe MMIO, entirely without CPU involvement in the data path. Unlike prior work that requires enterprise GPUs with native peer-to-peer (P2P) DMA support, gpu-nvme-direct operates on commodity hardware (AMD Ryzen 5800X, B450 motherboard) costing approximately $2,000.
We evaluate on two NVMe devices: a WD SN530 (PCIe 3.0 x4), where gpu-nvme-direct achieves 2,666 MB/s (78% of link bandwidth), and a WD SN740 (PCIe 4.0 x4), where it reaches 3,350 MB/s sustained (99% of link bandwidth). Integrated into a streaming inference engine, gpu-nvme-direct achieves 0.06 tok/s for Llama 70B from NVMe alone, and 0.50 tok/s when combined with tiered caching on a single RTX 3090.
The GPU writes NVMe doorbell registers via PCIe posted MMIO writes (PTX st.relaxed.mmio.sys). The CPU is not involved in the I/O data path. This works on AMD consumer platforms where P2P reads fail, because only posted writes are needed to drive an NVMe controller.
| Block Size | gpu-nvme-direct | CPU memcpy | CPU pinned | cuFile (GDS) |
|---|---|---|---|---|
| 4 KB | 336 MB/s | 180 | 153 | 151 |
| 16 KB | 1,393 MB/s | 626 | 516 | 510 |
| 64 KB | 2,111 MB/s | 1,373 | 1,152 | 1,144 |
| 256 KB | 2,666 MB/s | 1,731 | 1,409 | 1,396 |
| 512 KB | 2,634 MB/s | 2,008 | 1,700 | 1,696 |
Note: gpu-nvme-direct uses the slower SN530 (Gen3 x4). CPU baselines use the faster 980 PRO (Gen4 x4). Despite this 2x device asymmetry, gpu-nvme-direct achieves higher throughput at all block sizes.
| Configuration | BW | tok/s |
|---|---|---|
| mmap+memcpy (baseline) | 1.5 GB/s | 0.028 |
| gpu-nvme-direct NVMe-only (SN530) | 2.1 GB/s | 0.04 |
| gpu-nvme-direct NVMe-only (SN740) | 3.35 GB/s | 0.06 |
| Tiered VRAM/RAM (no NVMe) | 6.5 GB/s | 0.20 |
| Tiered + layer skip | 6.5 GB/s | 0.27 |
| Tiered + Q4_K_M + skip | 6.5 GB/s | 0.50 |
Llama 3.1-70B, Q6_K quantization (~42 GB, 80 layers), single RTX 3090.
| Component | Detail |
|---|---|
| GPU | NVIDIA RTX 3090 (GA102, 24 GB, PCIe 3.0 x8 on B450) |
| CPU | AMD Ryzen 7 5800X (8C/16T) |
| Platform | ASUS ROG STRIX B450-F GAMING II, 48 GB DDR4 |
| NVMe (test) | WD SN740 512 GB (Gen4 device, Gen3 on B450) |
| NVMe (boot) | WD SN530 1 TB (Gen3 x4) |
| OS/CUDA | Ubuntu 25.10, kernel 6.17, CUDA 13.1 |