minimal

minimal

A jekyll theme for minimal editions

Scalable Network Stack supporting TCP/IP, RoCEv2, UDP/IP at 10-100Gbit/s

Prerequisites

  • Xilinx Vivado 2022.2
  • cmake 3.0 or higher

Supported boards (out of the box)

  • Xilinx VC709
  • Xilinx VCU118
  • Alpha Data ADM-PCIE-7V3

Git submodules

This repository uses git submodules, so do one of the following:

# When cloning:
git clone --recurse-submodules [email protected]/this/repo.git
# Later, if you forgot or when submodules have been updated:
git submodule update --init --recursive

Compiling HLS modules

  1. Create a build directory

    mkdir build
    cd build
    
  2. Configure build

    cmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64
    

All cmake options: | Name | Values | Desription | | -------------------------------- | ---------------- | ----------------------------------------------------------------------- | | FNS_PLATFORM | xilinx_u55c_gen3x16_xdma_3_202210_1 | Target platform to build | FNS_DATA_WIDTH | <8,16,32,64> | Data width of the network stack in bytes | | FNS_ROCE_STACK_MAX_QPS | 500 | Maximum number of queue pairs the RoCE stack can support | FNS_TCP_STACK_MSS | #value | Maximum segment size of the TCP/IP stack | | FNS_TCP_STACK_FAST_RETRANSMIT_EN | <0,1> | Enabling TCP fast retransmit | FNS_TCP_STACK_NODELAY_EN | <0,1> | Toggles Nagle's Algorithm on/off | FNS_TCP_STACK_MAX_SESSIONS | #value | Maximum number of sessions the TCP/IP stack can support | FNS_TCP_STACK_RX_DDR_BYPASS_EN | <0,1> | Enabling DDR bypass on the RX path | FNS_TCP_STACK_WINDOW_SCALING_EN | <0,1> | Enalbing TCP Window scaling option

  1. Build HLS IP cores and install them into IP repository
    make ip
    

For an example project including the TCP/IP stack or the RoCEv2 stack with DMA to host memory checkout our Distributed Accelerator OS DavOS.

Working with individual HLS modules

  1. Setup build directory, e.g. for the TCP module
cd hls/toe
mkdir build
cd build
cmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64
  1. Run
    make csim # C-Simulation (csim_design)
    make synth # Synthesis (csynth_design)
    make cosim # Co-Simulation (cosim_design)
    make ip # Export IP (export_design)
    

Interfaces

All interfaces are using the AXI4-Stream protocol. For AXI4-Streams carrying network/data packets, we use the following definition in HLS:

template <int D>
struct net_axis {
    ap_uint<D>    data;
    ap_uint<D/8>  keep;
    ap_uint<1>    last;
};

TCP/IP

Open Connection

To open a connection the destination IP address and TCP port have to provided through the s_axis_open_conn_req interface. The TCP stack provides an answer to this request through the m_axis_open_conn_rsp interface which provides the sessionID and a boolean indicating if the connection was openend successfully.

Interface definition in HLS:

struct ipTuple {
    ap_uint<32>    ip_address;
    ap_uint<16>    ip_port;
};
struct openStatus {
    ap_uint<16>    sessionID;
    bool        success;
};

void toe(...
    hls::stream<ipTuple>& openConnReq,
    hls::stream<openStatus>& openConnRsp,
    ...);

Close Connection

To close a connection the sessionID has to be provided to the s_axis_close_conn_req interface. The TCP/IP stack does not provide a notification upon completion of this request, however it is guranteeed that the connection is closed eventually.

Interface definition in HLS:

hls::stream<ap_uint<16> >& closeConnReq,

Open a TCP port to listen on

To open a port to listen on (e.g. as a server), the port number has to be provided to s_axis_listen_port_req. The port number has to be in range of active ports: 0 - 32767. The TCP stack will respond through the m_axis_listen_port_rsp interface indicating if the port was set to the listen state succesfully.

Interface definition in HLS:

hls::stream<ap_uint<16> >& listenPortReq,
hls::stream<bool>& listenPortRsp,

Receiving notifications from the TCP stack

The application using the TCP stack can receive notifications through the m_axis_notification interface. The notifications either indicate that new data is available or that a connection was closed.

Interface definition in HLS:

struct appNotification {
    ap_uint<16>            sessionID;
    ap_uint<16>            length;
    ap_uint<32>            ipAddress;
    ap_uint<16>            dstPort;
    bool                closed;
};

hls::stream<appNotification>& notification,

Receiving data

If data is available on a TCP/IP session, i.e. a notification was received. Then this data can be requested through the s_axis_rx_data_req interface. The data as well as the sessionID are then received through the m_axis_rx_data_rsp_metadata and m_axis_rx_data_rsp interface.

Interface definition in HLS:

struct appReadRequest {
    ap_uint<16> sessionID;
    ap_uint<16> length;
};

hls::stream<appReadRequest>& rxDataReq,
hls::stream<ap_uint<16> >& rxDataRspMeta,
hls::stream<net_axis<WIDTH> >& rxDataRsp,

Waveform of receiving a (data) notification, requesting data, and receiving the data:

Transmitting data

When an application wants to transmit data on a TCP connection, it first has to check if enough buffer space is available. This check/request is done through the s_axis_tx_data_req_metadata interface. If the response through the m_axis_tx_data_rsp interface from the TCP stack is positive. The application can send the data through the s_axis_tx_data_req interface. If the response from the TCP stack is negative the application can retry by sending another request on the s_axis_tx_data_req_metadata interface.

Interface definition in HLS:

struct appTxMeta {
    ap_uint<16> sessionID;
    ap_uint<16> length;
};
struct appTxRsp {
    ap_uint<16> sessionID;
    ap_uint<16> length;
    ap_uint<30> remaining_space;
    ap_uint<2>  error;
};

hls::stream<appTxMeta>& txDataReqMeta,
hls::stream<appTxRsp>& txDataRsp,
hls::stream<net_axis<WIDTH> >& txDataReq,

Waveform of requesting a data transmit and transmitting the data.

RoCE (RDMA over Converged Ethernet)

The new RDMA-version (02/2024) is adapted from the one used in Coyote (https://github.com/fpgasystems/Coyote) and fully compatible to the RoCE-v2 standard, thus able to communicate to standard NICs (such as i.e. Mellanox-cards). It is proven to run at 100 Gbit / s, allowing for low latency and high throughput comparable to the results achievable with mentioned ASIC-based NICs.

The whole included design is defined in a Block Diagram as follows:

The packet processing pipeline is coded in Vitis-HLS and included in "roce_v2_ip", consisting of separate modules for the IPv4-, UDP- and InfiniBand-Headers. In the top-level-module "roce_stack.sv", this pipeline is then combined with HDL-coded ICRC-calculation and RDMA-flow control.

For actual usage of the RDMA-stack, it needs to be integrated into a full FPGA-networking stack and combined with some kind of shell that enables DMA-exchange with the host for both commands and memory access. An example for that is Coyote with a networking stack as depicted in the following block diagram:

Load Queue Pair (QP)

Before any RDMA operations can be executed the Queue Pairs have to established out-of-band (e.g. over TCP/IP) by the hosts. The host can the load the QP into the RoCE stack through the s_axis_qp_interface and s_axis_qp_conn_interface interface. Interface definition in HLS:

typedef enum {RESET, INIT, READY_RECV, READY_SEND, SQ_ERROR, ERROR} qpState;

struct qpContext {
    qpState        newState;
    ap_uint<24> qp_num;
    ap_uint<24> remote_psn;
    ap_uint<24> local_psn;
    ap_uint<16> r_key;
    ap_uint<48> virtual_address;
};
struct ifConnReq {
    ap_uint<16> qpn;
    ap_uint<24> remote_qpn;
    ap_uint<128> remote_ip_address;
    ap_uint<16> remote_udp_port;
};

hls::stream<qpContext>&    s_axis_qp_interface,
hls::stream<ifConnReq>&    s_axis_qp_conn_interface,

Issue RDMA commands

RDMA commands can be issued to RoCE stack through the s_axis_tx_meta interface. In case the commands transmits data. This data can be either originate from the host memory as specified by the local_vaddr or can originate from the application on the FPGA. In the latter case the local_vaddr is set to 0 and the data is provided through the s_axis_tx_data interface.

Interface definition in HLS:

typedef enum {APP_READ, APP_WRITE, APP_PART, APP_POINTER, APP_READ_CONSISTENT} appOpCode;

struct txMeta {
    appOpCode     op_code;
    ap_uint<24> qpn;
    ap_uint<48> local_vaddr;
    ap_uint<48> remote_vaddr;
    ap_uint<32> length;
};
hls::stream<txMeta>& s_axis_tx_meta,
hls::stream<net_axis<WIDTH> >& s_axis_tx_data,

Waveform of issuing a RDMA read request:

Waveform of issuing an RDMA write request where data on the FPGA is transmitted:

Publications

  • D. Sidler, G. Alonso, M. Blott, K. Karras et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15, Paper, Slides

  • D. Sidler, Z. Istvan, G. Alonso, Low-Latency TCP/IP Stack for Data Center Applications, in FPL'16, Paper

  • D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, G. Alonso, StRoM: smart remote memory, in EuroSys'20, Paper

Citations

If you use the TCP/IP or RDMA stacks in your project please cite one of the following papers and/or link to the github project:

@inproceedings{DBLP:conf/fccm/SidlerABKVC15,
  author       = {David Sidler and
                  Gustavo Alonso and
                  Michaela Blott and
                  Kimon Karras and
                  Kees A. Vissers and
                  Raymond Carley},
  title        = {Scalable 10Gbps {TCP/IP} Stack Architecture for Reconfigurable Hardware},
  booktitle    = {23rd {IEEE} Annual International Symposium on Field-Programmable Custom
                  Computing Machines, {FCCM} 2015, Vancouver, BC, Canada, May 2-6, 2015},
  pages        = {36--43},
  publisher    = {{IEEE} Computer Society},
  year         = {2015},
  doi          = {10.1109/FCCM.2015.12}

@inproceedings{DBLP:conf/fpl/SidlerIA16,
  author       = {David Sidler and
                  Zsolt Istv{\'{a}}n and
                  Gustavo Alonso},
  title        = {Low-latency {TCP/IP} stack for data center applications},
  booktitle    = {26th International Conference on Field Programmable Logic and Applications,
                  {FPL} 2016, Lausanne, Switzerland, August 29 - September 2, 2016},
  pages        = {1--4},
  publisher    = {{IEEE}},
  year         = {2016},
  doi          = {10.1109/FPL.2016.7577319}
}

@inproceedings{DBLP:conf/eurosys/SidlerWCKA20,
  author       = {David Sidler and
                  Zeke Wang and
                  Monica Chiosa and
                  Amit Kulkarni and
                  Gustavo Alonso},
  title        = {StRoM: smart remote memory},
  booktitle    = {EuroSys '20: Fifteenth EuroSys Conference 2020, Heraklion, Greece,
                  April 27-30, 2020},
  pages        = {29:1--29:16},
  publisher    = {{ACM}},
  year         = {2020},
  doi          = {10.1145/3342195.3387519}
}

@PHDTHESIS{sidler2019innetworkdataprocessing,
    author = {Sidler, David},
    publisher = {ETH Zurich},
    year = {2019-09},
    copyright = {In Copyright - Non-Commercial Use Permitted},
    title = {In-Network Data Processing using FPGAs},
}

@INPROCEEDINGS{sidler2020strom,
    author = {Sidler, David and Wang, Zeke and Chiosa, Monica and Kulkarni, Amit and Alonso, Gustavo},
    booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
    title = {StRoM: Smart Remote Memory},
    doi = {10.1145/3342195.3387519},
}

Contributors