Mellanox ConnectX devices implement DC-QCN algorithm for congestion control of RoCE flows. The DC-QCN algorithm is based on the combination of Data Center TCP (DCTCP) and Quantized Congestion Notification (QCN) algorithms and developed in collaboration with Microsoft Research as documented in the attached SIGCOMM'15 paper "Congestion control for large scale RDMA deployments" attached below.
This post is meant for advanced network engineers who wish to understand RoCE Congestion Control in more dept.
- Congestion control for large scale RDMA deployments
- RDMA/RoCE Solutions
- Understanding RoCEv2 Congestion Management
- RoCEv2 CNP Packet Format Example
The DC-QCN algorithm relies on Explicit Congestion Notification (ECN) marking in the switch. ECN is a common feature of commodity data center switches.
Two bits in Diffserv field of IP header of the packet are used to indicate congestion. Upon congestion in the switch, these two bits are marked as '11' (CE).
The congestion marking is a probabilistic function of queue length as depicted in the following figure. Two thresholds of queue length define the marking probability. When queue length is below the low threshold, ECN bits are not marked; and when the queue length is above the upper thresholds all the packets that are transmitted from the queue are ECN-marked. When the queue length is between the thresholds the packets are ECN-marked with marking probability that is linear with the queue length.
The ECN mark is propagated with the data packet to the receiver NIC. Receiver NIC creates Congestion Notification Packet (CNP) and sends it to the sender of the ECN-marked packet. CNP packet includes the information of the QP, which packet was marked. For more info on the CNP packet format, see RoCEv2 CNP Packet Format Example.
When CNP is received in the sender NIC, it throttles the transmission rate of the QP, based on the algorithm described next.
The DC-QCN Algorithm
The DC-QCN rate throttling algorithm is described by the following chart. Briefly, the algorithm continuously increases the rate if the QP based on internal timer and sent bytes counting, and reduces the rate of the QP based on the CNPs arrival. In addition it maintains parameter called alpha, that estimates the congestion grade in the network, and is used in rate reduction.
The algorithm is defined by three parallel flows:
1. Alpha update (congestion grade estimation)
2. Rate decrease
3. Rate increase .
The time is slotted with configurable period interval. Every time slot is indicated if CNP arrived in this slot. Alpha parameter is a moving average of the ratio of slots in which CNP arrived (there is no indication if more than single CNP arrived in the same time slot). Every end of period alpha is updated by the next equation:
new_alpha = (1-g)*old_alpha + g*CNP_arrived
when g is a constant parameter between 0 and 1, and CNP_arrived is a single bit field that indicates that CNP was arrived in the last time slot.
The time is slotted with configurable period interval (different than alpha update period). If CNP was arrived in the last time slot (there is no indication if more than single CNP arrived in the same time slot) the QP rate is reduced by the following equation:
new_rate = old_rate * (1 - alpha/2)
In addition several parameters for rate increase are reset, as detailed below.
Rate increase logic is very similar to the one defined by QCN. The logic is divided into three sequential phase:
1. Fast Recovery
2. Active Increase (Probing)
3. Hyper-active Increase (Probing)
Moving from one phase to the next is defined by parameter of counting the rate increase events in the phase. After the number of rate increase events in a phase passed predefined threshold, the logic moves to the next phase (simplified explanation). Rate decrease event resets all the counters related to rate increase, and returns to Fast Recovery phase. In addition, upon rate decrease, the current rate before reduction is stored in parameter called target_rate.
Rate increase event occurs after predefined time period was passed since previous rate increase or predefined amount of bytes were send since previous rate increase, given that no rate decrease event occurred meanwhile.
During Fast Recovery phase, in every rate increase event, the rate is increased by half distance to the target_rate. (current_rate = (current_rate + target_rate)/2 ). That allows fast return to the rate in which congestion occurred in the beginning of the phase, and more careful increase when the rate is close to the value in which congestion happened.
In the Active Increase (Probing) phase and in Hyper-active Increase (Probing), upon rate increase event, the rate is increased by a constant value. That allows gaining throughput when bandwidth frees up.