Design Gateway — Advanced Reference Design

Scalable NVMe over TCP
Initiator on Alveo Card

Advanced Reference Design for FPGA-Accelerated Distributed NVMe Storage Access

NVMeTCP25G-IP 4 × 25G Ethernet Zero CPU Protocol Load PCIe XRT DMA AMD Alveo U50

~9,500

MB/s

4× Read Throughput

~9,000

MB/s

4× Write Throughput

Built on an AMD Alveo accelerator card and powered by multiple NVMeTCP25G-IP cores, this reference design enables a host system to access four remote NVMe SSDs simultaneously through four independent 25G Ethernet links — with a scalable FPGA architecture ready for more NVMe/TCP sessions and zero CPU involvement in protocol processing.

The Challenge

Remove the CPU Bottleneck from NVMe over TCP Host Access

A standard NIC-based NVMe/TCP host relies on CPU software to handle all protocol layers — a bottleneck that grows with each additional remote SSD. This Alveo-based design offloads TCP/IP and NVMe/TCP processing into FPGA logic, keeping the CPU free for the application while the card handles data movement to host memory over PCIe/XRT.

Aspect	⚠ Standard NIC Approach	✓ NVMeTCP25G-IP on Alveo
Protocol Processing	CPU processes TCP/IP stack and NVMe/TCP protocol	Hardware offload for TCP/IP and NVMe/TCP host operations
Data Path	Multiple intermediate data copies in host memory	Independent DMA paths for multiple 25G sessions
Scalability	Scaling to multiple remote SSDs increases CPU and memory load	FPGA scalability for additional NVMe/TCP channels

Design Principles

Four Keys to Scalable Remote NVMe SSD Access

The design combines FPGA protocol offload, independent 25G Ethernet channels, direct host-memory transfer, and a scalable IP-core architecture to create a practical platform for high-performance remote storage access.

Full NVMe/TCP Host Offload

NVMeTCP25G-IP integrates TCP/IP stack and NVMe/TCP host functions in hardware for write/read access to remote NVMe SSDs — with zero CPU involvement in protocol processing.

Four Independent 25G Links

Four 25G Ethernet connections operate simultaneously, allowing the host to access four remote NVMe SSDs in parallel with full per-session bandwidth isolation.

PCIe Accelerator Integration

The Alveo card plugs into the host system as a PCIe accelerator, using XRT for register access and high-speed DMA transfers directly into host memory.

Scalable FPGA Architecture

Additional NVMeTCP25G-IP instances can be integrated to expand the number of remote NVMe SSD sessions, scaling storage bandwidth for larger deployments.

Target Markets

Ideal for Applications Requiring Distributed High-Speed Storage

Especially attractive where data is generated, stored, or processed across multiple remote locations but must be accessed by a central host with minimal CPU overhead.

AI / HPC

Offload NVMe/TCP transport entirely to the Alveo FPGA, freeing host CPUs for training and inference. A GPU server and FPGA card connect through a 25G Ethernet switch to access model data, intermediate results, and inference artifacts in parallel — with zero software stack overhead and deterministic low latency across all sessions.

Remote Target Servers

Datasets / Model CheckpointsTraining data ingestion and checkpoint persistence

Shared Scratch / ResultsIntermediate outputs and distributed job results

Inference ModelsDeployed model weights and serving artifacts

Video & Media Processing

Sustain high-bandwidth, multi-stream media workflows without burdening the host CPU. A GPU server with the Alveo card routes 25GbE sessions through a switch to dedicated media servers — enabling simultaneous codec processing, frame-accurate playback, and adaptive HTTP delivery from independent NVMe storage targets.

Remote Target Servers

Transcode ServerCodec processing for format and bitrate conversion

Player ServerFrame scheduler & timecode management

Edge PlayerSegment packaging and HTTP delivery

Measured Results

Performance Matrix

Benchmarked on a single Alveo U50 card. All throughput figures are sustained transfer rates over 25G Ethernet to remote NVMe SSDs, with TCP/IP and NVMe/TCP fully offloaded to FPGA logic — the host CPU contributes 0% to protocol processing.

Configuration	Read Speed	Write Speed	CPU Usage
NVMeTCP25G ×1	2,679 MB/s	2,531 MB/s	0%
NVMeTCP25G ×4	~9,500 MB/s	~9,000 MB/s	0%

A single IP core sustains 2,679 MB/s read / 2,531 MB/s write — near line-rate for a 25G Ethernet link. Scaling to four independent NVMeTCP25G-IP instances delivers ~9,500 MB/s read and ~9,000 MB/s write, with throughput growing linearly as additional cores are added. Because the entire protocol stack runs in FPGA logic, host CPU utilisation remains at 0% throughout.

Technical Specs

System Requirements

FPGA Accelerator Card	Xilinx Alveo U50 (16nm UltraScale+ FPGA, PCIe Gen3 x16)
Network Interfaces	100G to 4× 25G breakout cable
IP Core	`NVMeTCP25G-IP` — 4 instances, each managing one 25GbE session
Protocol Support	NVMe/TCP (NVMe over TCP/IP)
Host Interface	PCIe Gen3 x16 — standard add-in card slot
Target System	Any Linux PC or server running NVMe/TCP target driver (`nvmet`) with NVMe SSD
Host OS	Ubuntu 20.04.1 OS

Live Demonstration

Free Evaluation Demo

Watch the full demo of four NVMeTCP25G-IP sessions running simultaneously on the Alveo U50 card — real hardware, real NVMe SSDs, real 25G Ethernet links.

For more details, please refer to the demo videos and documentation published on our website.

📄 Reference Design Document 📋 Demo Instruction Manual 💾 Free Evaluation Demo Bitfile (Alveo U50) ✉️ Contact Us

DEMO VIDEO

Chapter 1 · Hardware Overview & Setup

Chapter 2 · 4 Sessions with DMA on Alveo U50

Scalable NVMe over TCP Initiator on Alveo Card

Scalable NVMe over TCP
Initiator on Alveo Card