Understanding RoCEv2 Congestion Management

Version 17

    This post explains the basics of congestion management for RoCEv2.






    The Problem

    Network congestion happens in the network switches when the incoming traffic is larger than the bandwidth of the outgoing link on which it has to be transmitted. Typical example is multiple senders that send traffic to the same destination in the same time. Switch buffers can handle temporal congestion, but when congestion is too long, switch buffer fills up to their buffering capacity. When switch buffer is full the next packets arriving packets are dropped. Dropping packets reduces application performance, due to the latency cost of re-transmission and complexity of transport protocol. Lossless networks implement mechanism of flow control, which pauses the traffic in the incoming link before the buffer overfills, and by that prevents case of dropping packets. However, flow control by itself causes congestion spreading problem.


    To understand the congestion spreading problem consider the figure below. Assume that ports A through E on Switch 1 are all sending packets to port G so that port G is receiving data at 100% of its capability to send it. Also assume that port F, on adjacent Switch 2 is also transmitting data to port G on Switch 1 at 20% of the total link bandwidth. Since egress port G is backed up, port F will transmit packets until will be paused by flow control. At this point, port G will be congested, however there is no detrimental side effect, since all ports (A-F) will be served as quickly as port G is able to.


    Now consider a port X on Switch 2, which is sending packets to port Y on Switch 1, at 20% of the paths bandwidth. Port G, the source of the congestion is not anywhere in the path from port X to port Y. In this case one might expect that since port F was only using 20% of the inter-switch links bandwidth, the remaining 80% of the links bandwidth would be available for port X, much more than port X requires. However this is not the case, since traffic from port F will eventually cause the flow control to send pauses on the inter-switch link, and reduce the traffic from port X to 20% instead of potentially available 80%.



    Congestion Control


    Congestion control is used to reduce packet drops in lossy networks or congestion spreading in lossless networks. It also reduces switch buffer occupancy, hence, decreases latency and improves burst tolerance.  The approach is to limit the injection rate of flows at the ports which are the root cause of the congestion (ports A-F), so that other ports are not affected (port X).

    By limiting the injection rate of ports A-F to something which port G can handle, ports A-F should not see a significant degradation (after all, their packets were just going to wait anyway), however packets being sent from port X to port Y should be able to flow normally, since pauses will not be sent by flow control (congestion control aims to keep switch buffer occupancy low, so the flow control will not kick in).


    Current RoCE congestion control relies on Explicit Congestion Notification (ECN) in order to operate.


    Explicit Congestion Notification (ECN)

    ECN was initially defined for TCP/IP in RFC 3168 by embedding an indication for congestion in the IP header and an acknowledgement in the TCP header. ECN compatible switches and routers mark packets when congestion is detected. The congestion indication in IP header is also used by congestion control of RoCEv2.


    Below is the format of the first four octets of IP header:



    Below is the IP header packet format:


    [The figure was taken from here]




    RoCEV2 Congestion Management


    RoCEv2 standard defines RoCEv2 Congestion Management (RCM). RCM provides the capability to avoid congestion hot spots and optimize the throughput of the fabric. With RCM, incipient congestion in the fabric is reported back to the traffic sources that in turn react by throttling down their injection rates, thus preventing the negative effects of fabric buffer saturation and increased queuing delays. Congestion Management is also relevant for co-existing TCP/UDP/IP traffic. However, assuming the intended use of a distinct set of priorities for RoCEv2 and the other traffic (each set of priorities having a bandwidth allocation), the effects of congestion and the reaction (or lack of it) should not impact one another.


    For signaling of congestion, RCM relies on the mechanism defined in RFC3168 (ECN) shown above. Upon congestion that involves RoCEv2 traffic, network devices mark the packets using the ECN field in the IP header. This congestion indication is interpreted by destination end-nodes in the spirit of the FECN congestion indication flag of the Base Transport Header (BTH). In other words, as ECN marks packets that arrive to their intended destination, the congestion notification is reflected back to the source which in turn reacts by rate limiting the packet injection for the QP in question.


    RCM is optional normative behavior. RoCEv2 HCAs that implement RCM shall follow the rules specified in this section:

    • When receiving a valid RoCEv2 packet with a value of ’11 in its IP.ECN field, the HCA shall generate a RoCEv2 CNP formatted as shown in the figure below directed to the source of the received packet. The HCA may choose to send a single CNP for multiple such ECN marked packets on a given QP
    • When receiving a RoCEv2 CNP, the HCA shall reduce the rate of injection for the QP indicated in the RoCEv2 CNP (The amount of rate change is determined by a configurable rate reduction parameter).
    • The HCA should increase the injection rate on a QP when a configurable amount of elapsed time and/or a configurable number of bytes have been transmitted on that QP since the reception of the most recent RoCEv2 CNP for that QP.


    RoCEv2 CNP format is depicted in following figure.



    See https://cw.infinibandta.org/document/dl/7781 and RoCEv2 CNP Packet Format Example for more info.


    Basic Terminology


    RP (Injector)Reaction Point - the end node that performs rate limitation to prevent congestion
    NPNotification Point - the end node that receives the packets from the injector and sends back notifications to the injector for indications regarding the congestion situation
    CPCongestion Point - the switch queue in which congestion happens
    CNPThe RoCEv2 Congestion Notification Packet - The notification message an NP sends to the RP when it receives CE marked packets.




    Congestion Control Loop

    RoCEv2 congestion control loop is described in the following stages below:



    1. The injecting end station (Injector) must set the ECN bits in the IP header. The values of the bits are as follows according to the RFC 3168  (ECT: ECN-Capable Transport)


    00Not ECT


         The injecting NIC sets the ECN field in IP header to value of ECT(1) (‘01’). (Note: Setting ECT (0) or ECT (1) is interchangeable.)


    2. RoCEv2 packet goes from the injector to the network as follows:






    3.The Router(s), in case of congested queue, instead of dropping the packet may examine the ECN capable field and turn ON the CE bit inside the IP header.


    4. The packet arrives from the network to the receiver in case of congestion as follows:




    5. The receiving end station filters the packets with CE bit turned and the traffic type (RoCE) ON, triggers the event, and releases the packet to the normal processing flow.


    6-7. To avoid load due to traffic generation, the receiving end station should aggregate Congestion Notification for each injector (QP). One CN packet is sent to the injector once in x microseconds.

           The ECN bits in the packet are set to ‘01’ to verify that the packet is not dropped by the IP routers. The IBA BTH header will be built as defined above.


    8. The CNP packet goes from the receiver to the network as follows:



    See here for more info: RoCEv2 CNP Packet Format Example.


    9. The IP routers treats the InfiniBand CNP as a regular IP packet.


    10. The CNP packet arrives to the injecting end station. The injecting end station filters the packet with ECN bits ‘01’ and the RoCE packet and applies the corresponding rate limiter to that flow.