This post discuss the network considerations to operate RoCE v2.
- Introduction to RoCE
- Network Requirements
- What is RDMA?
- Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters
- HowTo Configure RoCE v2 for ConnectX-3 Pro using Mellanox SwitchX Switches
Introduction to RoCE
What is RoCE?
RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to dramatically accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory to memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by hardware resulting in dramatically lower latency, higher throughput, and better performance compared to software based protocols.
What are the differences between RoCE v1 and RoCE v2?
As originally implemented and standardized by the InfiniBand Trade Association (IBTA) RoCE was envisioned as a layer 2 protocol. Effectively the IBTA layer 1 and 2 fields are replaced by the corresponding Ethernet fields. Specifically at layer 2 the local routing header (LRH) is replaced by an Ethernet MAC header and frame check sequence. The EtherType field indicates the payload encapsulates the RoCE protocol which implements the IBTA protocol above layer 2. In addition the IBTA network management (subnet manager) is replaced by standard Ethernet layer 2 management protocols.
This approach has the advantages that it is simple to implement, strictly layered, and preserves the application level API verbs which sit above the channel interface. The disadvantage is the scalability limitations of a layer 2 Ethernet deployment caused by broadcast domains and complexity of IP allocations constraints across a flat subnet. In addition certain switches may forward RoCE packets on a slower exception path as compared to the more common IP packets. These limitations have driven the demand for RoCE to operate in layer 3 (routable) environments. Fortunately a straightforward extension of the RoCE framework allows it to be readily transported across layer 3 networks. As shown in the figure below, a layer 3 capable RoCE protocol simply continues up the stack and replaces the optional L3 global routing header (GRH) with the standard IP networking header and adds a UDP header as a stateless encapsulation of the layer 4 payload. This is a very natural extension of RoCE as the layer 3 header is already based on an IP address and thus this substitution is straightforward. In addition the UDP encapsulation is a standard type of L4 packet and thus is forwarded efficiently by routers as a mainstream data path operation.
What is the RoCE V2 packet format?
RoCE can operates in eithr lossless or lossy network.
RoCE over lossy network is called Resilient RoCE, see Introduction to Resilient RoCE - FAQ to understand more about this.
How do I achieve lossless Ethernet L2 network?
At the link level, it can be achieved by using flow control. Flow control is achieved by either enabling global pause across the network, or by the use of priority flow control (PFC). PFC is a link level protocol that allows a receiver to assert flow control telling the transmitter to pause sending traffic for a specified priority. PFC supports flow control on individual priorities as specified in the class of service field of the 802.1Q VLAN tag. Thus it is possible for a single link to carry both lossless traffic to support RoCE and other best effort traffic on a lower priority class of service.
If I run RoCE v2, should I use PFC or global pause for lossless L2 subnet?
In a converged environment lossy traffic share the same physical link with lossless RoCE traffic. Typically separate dedicated buffering and queue resources are allocated within switches and routers for the lossless and best-effort traffic classes that effectively isolates these flows from one another. Although global pause configuration is easier and might work nicely in a lab condition, it is recommended to use PFC in operational network in order to be able to differentiate between different flows. otherwise, In case of congestion, important lossy traffic, such as control protocols may be affected. Therefore, RoCE should run on a VLAN with priority enabled with PFC, while the control protocols (lossy) will run without flow control enabled on different priority.
How do I preserve lossless characteristics on L3 network (between L2 subnets)?
Operating RoCE at layer 3 requires that the lossless characteristic of the network are preserved across L3 routers that connect layer 2 subnets. The intervening L3 routers should be configured to transport layer 2 PFC lossless priorities across the layer 3 router between Ethernet interfaces on the respective subnets. This can typically be accomplished through standard router configuration mechanisms mapping the received layer 2 priority settings to the corresponding layer 3 Differentiated Serviced Code Point (DSCP) QoS setting. The peer host should mark the RDMA packet with DSCP and/or L2 priority bits (PCP). There are two ways for the router to extract the priority from the packet, either from the DSCP (in this case the packet could be untagged) or via PCP (in this case the packet must carry a VLAN (as the PCP is part of the VLAN tag). The router should keep the DSCP bits unchanged, and make sure the L2 PCP bits (if VLAN exists) copied to the the next network.
What happens when I use multi-path routing (ECMP) on L3 networks?
Rather than being constrained by layer 2 link-breaking protocols such as spanning tree algorithm, layer 3 networks can implement forwarding algorithms that take much better advantage of available network connectivity. Advanced data center networks can utilize multi-path routing mechanisms for load balancing and improved utilization. One commonly used protocol to achieve these goals is Equal Cost Multiple Path (ECMP). For each received packet the L3 Routers make a forwarding decision based on not just the destination IP address but also on other fields within the packet. In cases where there are many possible paths to a given endpoint ECMP allows different flows to select different paths and thus to leverage the available connectivity. The path selection for a given packet is based on the destination IP address and a hashed value of other packet fields. Note that while different flows can exploit different paths, the values used to select the output port for forwarding is deterministic such that packet ordering for a given flow is preserved.
In addition, when using Reliable Connection RDMA (RC) the Source UDP port is scrambled per QP. This helps for the ECMP hush function to span different RDMA flows on different spines in large L3 network.