Understanding RoCEv2 Congestion Management

Version 13

    This post explains the basic behavior of Congestion Notification for RoCEv2 (RCM).

    This post is basic and meant for beginners.





    Modeling the Problem

    Congestion Control is used to reduce packet drops in lossy environments and mitigate congestion spreading. It also reduces switch buffer utilization, which in turn, reduces latency and improves burst tolerance.


    The figure below demonstrates a victim flow scenario. In the absence of Congestion Control, flow X->Y suffers from reduced bandwidth due to flow F->G, which experiences congestion. With Congestion Control, the rate of flow F->G is reduced to its share of the congested link, enabling flow X->Y to obtain the remaining capacity of the shared link.


    Note: Victim flows may occur even in single switch configurations. For example, in Figure 1, a possible flow A->Y could also be stalled by pause frames sent by the switch due to flow A->G.




    ECN Frame Format (IP/TCP)

    ECN was initially defined for TCP/IP in RFC 3168 by embedding an indication for congestion in the IP header and an acknowledgement in the TCP header. However, the IP congestion indication may also be used by other transports such as RoCEv2. ECN compatible switches and routers mark packets when congestion is detected in a transport independent manner.


    Here are the first four octets of the IP header:



    Here is the full IP header packet format:


    [The figure was taken from here]




    RoCEV2 Congestion Management  (IB Spec Annex A17.9.3 RCM for RoCE v2)

    As in TCP/IP traffic, ECN acknowledge feedback returns to the TCP header, the question arise what happens in case of RDMA traffic in the format of RoCEv2 (tunnels on top of UDP).


    RoCEv2 standard defines RoCEv2 Congestion Management (RCM). The RCM provides the capability to avoid congestion hot spots and optimize the throughput of the fabric. With RCM, incipient congestion in the fabric is reported back to the traffic sources that in turn react by throttling down their injection rates, thus preventing the negative effects of fabric buffer saturation and increased queuing delays. Congestion Management is also relevant for co-existing TCP/UDP/IP traffic. However, assuming the intended use of a distinct set of priorities for RoCEv2 and the other traffic (each set of priorities having a bandwidth allocation), the effects of congestion and the reaction (or lack of it) should not impact one another.


    For signaling of congestion, RCM relies on the mechanism defined in RFC3168 (ECN) shown above. Upon congestion that involves RoCEv2 traffic, network devices mark the packets using the ECN field in the IP header. This congestion indication is interpreted by destination end-nodes in the spirit of the FECN congestion indication flag of the Base Transport Header (BTH). In other words, as ECN marks packets that arrive to their intended destination, the congestion notification is reflected back to the source which in turn reacts by rate limiting the packet injection for the QP in question.


    RCM is optional normative behavior. RoCEv2 HCAs that implement RCM shall follow the rules specified in this section:

    • When receiving a valid RoCEv2 packet with a value of ’11 in its IP.ECN field, the HCA shall generate a RoCEv2 CNP formatted as shown in the figure below directed to the source of the received packet. The HCA may choose to send a single CNP for multiple such ECN marked packets on a given QP
    • When receiving a RoCEv2 CNP, the HCA shall reduce the rate of injection for the QP indicated in the RoCEv2 CNP (The amount of rate change is determined by a configurable rate reduction parameter).
    • The HCA should increase the injection rate on a QP when a configurable amount of elapsed time and/or a configurable number of bytes have been transmitted on that QP since the reception of the most recent RoCEv2 CNP for that QP.


    Here is the RoCEv2 CNP Packet format:



    See https://cw.infinibandta.org/document/dl/7781 and RoCEv2 CNP Packet Format Example for more info.


    Basic Terminology


    RP (Injector)Reaction Point - the end node that performs rate limitation to prevent congestion
    NPNotification Point - the end node that receives the packets from the injector and sends back notifications to the injector for indications regarding the congestion situation
    CPCongestion Point - the switch queue in which congestion happens
    CNPThe RoCEv2 Congestion Notification Packet - The notification message an NP sends to the RP when it receives CE marked packets.



    Flow Control and Congestion Management Relationship

    Global Pause Flow Control or PFC configuration is orthogonal to RCM which means that you can enable or disable each one of them. However, It is strongly recommended to enable Flow Control or PFC in the network along with RCM.


    The Stages of Congestion Control Loop

    RoCEv2 congestion control loop is described in the following stages below:



    1. The injecting end station (Injector) must set the ECN bits in the IP header. The values of the bits are as follows according to the RFC 3168  (ECT: ECN-Capable Transport)


    00Not ECT


         The injecting end station will set the ECN field in IPV4 or IPV6 header to the ECN capable value of ECT(0) (‘10’). This is done at the application level, and allows the application to specify for which RoCE flows the HW should perform L3 Congestion Control.


    2. RoCEv2 packet goes from the injector to the network as follows:






    3.The Router(s), in case of congested queue, instead of dropping the packet may examine the ECN capable field and turn ON the CE bit inside the IP header.


    4. The packet arrives from the network to the receiver in case of congestion as follows:




    5. The receiving end station filters the packets with CE bit turned and the traffic type (RoCE) ON, triggers the event, and releases the packet to the normal processing flow.


    6-7. To avoid load due to traffic generation, the receiving end station should aggregate Congestion Notification for each injector (QP). One CN packet is sent to the injector once in x microseconds.

           The ECN bits in the packet are set to ‘01’ to verify that the packet is not dropped by the IP routers. The IBA BTH header will be built as defined above.


    8. The CNP packet goes from the receiver to the network as follows:



    See here for more info: RoCEv2 CNP Packet Format Example.


    9. The IP routers treats the InfiniBand CNP as a regular IP packet.


    10. The CNP packet arrives to the injecting end station. The injecting end station filters the packet with ECN bits ‘01’ and the RoCE packet and applies the corresponding rate limiter to that flow.