This post shows how to enable and configure DCQCN (RoCE congestion control) in Linux for ConnectX-4 adapters.
- Server installed with ConnectX-4
- Linux OS + latest MLNX_OFED
Commands for MLNX_OFED 4.1 or above
In MLNX_OFED the path of the parameters has been changed to the following path:
This is also planned to be added to the upstream kernel.
# ls /sys/kernel/debug/mlx5/0000\:04\:00.0/cc_params/
-rw------- 1 root root 0 May 18 12:36 np_cnp_dscp
-rw------- 1 root root 0 May 18 12:36 np_cnp_prio
-rw------- 1 root root 0 May 18 12:36 np_cnp_prio_mode
-rw------- 1 root root 0 May 18 12:36 rp_ai_rate
-rw------- 1 root root 0 May 18 12:36 rp_byte_reset
-rw------- 1 root root 0 May 18 12:36 rp_clamp_tgt_rate
-rw------- 1 root root 0 May 18 12:36 rp_clamp_tgt_rate_ati
-rw------- 1 root root 0 May 18 12:36 rp_dce_tcp_g
-rw------- 1 root root 0 May 18 12:36 rp_dce_tcp_rtt
-rw------- 1 root root 0 May 18 12:36 rp_gd
-rw------- 1 root root 0 May 18 12:36 rp_hai_rate
-rw------- 1 root root 0 May 18 12:36 rp_initial_alpha_value
-rw------- 1 root root 0 May 18 12:36 rp_min_dec_fac
-rw------- 1 root root 0 May 18 12:36 rp_min_rate
-rw------- 1 root root 0 May 18 12:36 rp_rate_reduce_monitor_period
-rw------- 1 root root 0 May 18 12:36 rp_rate_to_set_on_first_cnp
-rw------- 1 root root 0 May 18 12:36 rp_threshold
-rw------- 1 root root 0 May 18 12:36 rp_time_reset
1. Starting from MLNX_OFED 4.1 ECN will be enabled by default (in the firmware).
2. rp_max_rate and np_min_time_between_cnps removed.
To learn more about those parameters, see DC-QCN Parameters.
Mapping between old name to new parameter name
|Old Name||New Name|
0 - means use L2 priority (configured).
1 - means use the priority from the incoming packet
|enable (for NP)||N/A (can be controlled by the firmware)|
|enable (For RP)||N/A controlled by the firmware|
Commands for MLNX_OFED 4.0 or below
Note: DCQCN parameters are located on the following path /sys/class/net/<interface>/ecn
# ls /sys/class/net/ens785f1/ecn
Notification point (NP) parameters:
# ls /sys/class/net/ens785f1/ecn/roce_np
cnp_802p_prio // PCP of CNP packets
cnp_dscp //DSCP of CNP packets
enable //enable congestion control (responding with CNPs to ECN-marked arrived RoCE packets)
min_time_between_cnps //minimal time between two consecutive CNPs sent
// if ECN-marked RoCE packet arrives in a period smaller than min_time_between_cnps
// since previous sent CNP, no CNP will be sent as a response
Reaction point (RP) parameters:
# ls /sys/class/net/ens785f1/ecn/roce_rp
clamp_tgt_rate // If set, when receiving a CNP, the target rate is always updated
// to be the current rate (contrary to original algorithm)
dce_tcp_g // Weight of the new sampling in moving average calculation of alpha
enable // Enable congestion control (rate limiting of flows after CNP arrival)
rate_reduce_monitor_period // Minimal interval for rate reduction for a flow. If a CNP is received
// during the interval, the flow rate is reduced at the beginning of the next
// rate_reduce_monitor_period interval to (1-Alpha/Gd)*CurrentRate.
// rpg_gd is given as log2(Gd), where Gd may only be powers of 2.
rpg_ai_rate // Rate increase in AI mode
rpg_gd // If an CNP is received, the flow rate is reduced at the beginning of the next
// rate_reduce_monitor_period interval to (1-Alpha/Gd)*CurrentRate.
// rpgGd is given as log2(Gd), where Gd may only be powers of 2.
rpg_max_rate // The maximum rate, in Mbits per second, at which a congestion-controlled flow
// can transmit. Once this limit is reached, the flow is not rate limited any more.
rpg_min_rate // The minimum value, in Mbits per second, for rate to limit.
rpg_time_reset // Time counter for rate increase event
clamp_tgt_rate_after_time_inc // If set, when receiving a CNP, the target rate is updated to be the
// current rate also if the last rate increase event was due to the timer,
// and not only due to the byte counter (contrary to original algorithm)
dce_tcp_rtt // Window for sampling of moving average calculation of alpha
initial_alpha_value // Initial alpha value for a new QP
rate_to_set_on_first_cnp // The rate that is set for the flow, upon first CNP received, in Mbps.
rpg_byte_reset // Byte counter for rate increase event
rpg_hai_rate // Rate increase in HAI mode
rpg_min_dec_fac // Maximal factor by which the rate can be reduced (2 means that the new rate can be divided by 2 at maximum)
rpg_threshold // Number of rate increase events for switching between Fast Recovery, Active Increase, Hyper Active Increase modes.
Driver Configuration for OFED 4.0 or below
1. Enable DCQCN on specific priority
For example, to enable DCQCN on priority 3 as RP and NP, run:
# echo 1 > /sys/class/net/ens785f1/ecn/roce_np/enable/3
# echo 1 > /sys/class/net/ens785f1/ecn/roce_rp/enable/3
You may run that on all priorities (0-7).
Notification Point: Sending CNP packets is handled globally per port, any priority enabled here will set it on.
- Reaction Point: Handling CNP is per priority configured.
- ECN bits on the IP header are always marked with 01 for RoCE traffic (whenever RoCE CC is enabled on a priority or not).
2. It is recommended to set the CNP DSCP or 802p (PCP) priority values on the NP and set guaranteed QoS in the switches for this value .
# echo 48 > /sys/class/net/ens785f1/ecn/roce_np/cnp_dscp
# echo 6 > /sys/class/net/ens785f1/ecn/roce_np/cnp_802p_prio
The reason for this is that in case of congestion, it is desired that the CNP packets can bypass the congested data packets and reach the source of congested flows faster.
3. If you are using more than one TC, map priority to TC using mlnx_qos tool.
In this example you can see that priority 0 is mapped to tc0.
# mlnx_qos -i ens785f1 -p 0,1,2,3,4,5,6,7
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 0 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
tc: 1 ratelimit: unlimited, tsa: vendor
tc: 2 ratelimit: unlimited, tsa: vendor
tc: 3 ratelimit: unlimited, tsa: vendor
tc: 4 ratelimit: unlimited, tsa: vendor
tc: 5 ratelimit: unlimited, tsa: vendor
tc: 6 ratelimit: unlimited, tsa: vendor
tc: 7 ratelimit: unlimited, tsa: vendor
Non Volatile Considerations
The driver configuration is not persistent. For persistent configuration mlxconfig tool is used.
1. Make sure MFT package is installed, if not. Need to download it from Mellanox.com and install it.
2. Start MFT
# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Unloading MST PCI module (unused) - Success
3. Get the device status
# mst status
MST PCI module is not loaded
MST PCI configuration module loaded
/dev/mst/mt4115_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
4. Enable DCQCN
Install MFT and use mlxconfig command setting the bitmap parameter ROCE_CC_PRIO_MASK_P1 or ROCE_CC_PRIO_MASK_P2 (for the second port)
This parameter is 8 bits bitmap. the LSB is mapped to priority 0 (tc0) , while the MSB is priority 7 (tc0).
For example, to enable DCQCN on tc0 (00000001b)
# mlxconfig -d /dev/mst/mt4115_pciconf0 -y s ROCE_CC_PRIO_MASK_P1=0x1
Another example, to enable DCQCN on tc5 (001000000b = 0x20h) run:
# mlxconfig -d /dev/mst/mt4115_pciconf0 -y s ROCE_CC_PRIO_MASK_P1=0x20
5. Set CNP priority and DSCP fields.
For example, set CNP egress priority on 6, and egress DSCP value of 48
# mlxconfig -d /dev/mst/mt4115_pciconf0 -y s CNP_DSCP_P1=48 CNP_802P_PRIO_P1=6
Note: MLNX_OFED 3.4 has a known bug with the CNP_DSCP_P1 and CNP_802P_PRIO_P1 configuration, it doesn't reflect in the driver correctly.
Other DCQCN related parameters can be found in MFT User Manual on Mellanox.com
6. Reset the firmware to load the parameters
mlxfwreset -d /dev/mst/mt4115_pciconf0 -y reset
Minimal reset level for device, /dev/mst/mt4115_pciconf0:
3: Driver restart and PCI reset
Continue with reset?[y/N] y
-I- Stopping Driver -Done
-I- Sending Reset Command To Fw -Done
-I- Resetting PCI -Done
-I- Starting Driver -Done
-I- Restarting MST -Done
-I- FW was loaded successfully.
After driver, reset, you can verify that the configuration is updated by query the device again.
There are two ways to verify that:
One is to see that the firmware defaults where changed (via mlxconfig query)
Second is to see that the driver is actually reflecting this configuration (via sysfs parameters, cat /sys…)
1. Verify firmware defaults.
# mst start
mlxconfig -d /dev/mst/mt4115_pciconf0 q
ROCE_CC_PRIO_MASK_P1, CNP_DSCP_P1 and CNP_802P_PRIO_P1 should show the current configuration.
2. Check that the driver is enabled with DCQCN (reflect the configuration)
# cat /sys/class/net/ens3/ecn/roce_np/enable/*
# cat /sys/class/net/ens3/ecn/roce_rp/enable/*
Use perftest package (e.g. ib_write_bw) with the flag --sl (or -S) Service Level (skprio) which sets the SL priority to the ECN priority and TC on the switch.
-S 3 in this example sets the user priority of the RDMA packets to 3.
Send traffic with User Priority of 0 (and DSCP 0). In case you use different User Priory, change the -S parameter.
# ib_write_bw --report_gbits -D5 -d mlx5_0 -F -x 6 -S 3 22.214.171.124
Fast and Simple Testing (No QoS)
In case you understand the theory (it is complex) and/or just wishes to get the configuration commands to enable ECN on the adapter, check the following procedure.
In this plan, all traffic goes with no QoS, over priority 0,traffic class 0. Refer to HowTo Configure Resilient RoCE End-to-End Using ConnectX-4 and Spectrum (No QoS) for more information.