HowTo Configure DCQCN (RoCE CC) for ConnectX-4 (Linux)

Version 20

    This post shows how to enable and configure DCQCN (RoCE congestion control) in Linux for ConnectX-4 adapters.

     

    References

     

    Setup

    • Server installed with ConnectX-4
    • Linux OS + latest MLNX_OFED

     

    Commands for MLNX_OFED 4.1 or above

    In MLNX_OFED the path of the parameters has been changed to the following path:

    This is also planned to be added to the upstream kernel.

     

    /sys/kernel/debug/mlx5/<PCI BUS>/cc_params/

     

    For example:

    # ls /sys/kernel/debug/mlx5/0000\:04\:00.0/cc_params/

     

    -rw------- 1 root root 0 May 18 12:36 np_cnp_dscp

    -rw------- 1 root root 0 May 18 12:36 np_cnp_prio

    -rw------- 1 root root 0 May 18 12:36 np_cnp_prio_mode

    -rw------- 1 root root 0 May 18 12:36 rp_ai_rate

    -rw------- 1 root root 0 May 18 12:36 rp_byte_reset

    -rw------- 1 root root 0 May 18 12:36 rp_clamp_tgt_rate

    -rw------- 1 root root 0 May 18 12:36 rp_clamp_tgt_rate_ati

    -rw------- 1 root root 0 May 18 12:36 rp_dce_tcp_g

    -rw------- 1 root root 0 May 18 12:36 rp_dce_tcp_rtt

    -rw------- 1 root root 0 May 18 12:36 rp_gd

    -rw------- 1 root root 0 May 18 12:36 rp_hai_rate

    -rw------- 1 root root 0 May 18 12:36 rp_initial_alpha_value

    -rw------- 1 root root 0 May 18 12:36 rp_min_dec_fac

    -rw------- 1 root root 0 May 18 12:36 rp_min_rate

    -rw------- 1 root root 0 May 18 12:36 rp_rate_reduce_monitor_period

    -rw------- 1 root root 0 May 18 12:36 rp_rate_to_set_on_first_cnp

    -rw------- 1 root root 0 May 18 12:36 rp_threshold

    -rw------- 1 root root 0 May 18 12:36 rp_time_reset

     

    Notes:

    1. Starting from MLNX_OFED 4.1 ECN will be enabled by default (in the firmware).

    2. rp_max_rate and np_min_time_between_cnps  removed.

     

    To learn more about those parameters, see DC-QCN Parameters.

     

    Mapping between old name to new parameter name

    Old NameNew Name
    cnp_802p_prionp_cnp_prio
    cnp_dscpnp_cnp_dscp
    N/A

    np_cnp_prio_mode

    0 - means use L2 priority (configured).

    1 - means use the priority from the incoming packet

    enable (for NP)N/A (can be controlled by the firmware)
    min_time_between_cnpsN/A
    clamp_tgt_raterp_clamp_tgt_rate
    dce_tcp_grp_dce_tcp_g
    enable  (For RP)N/A controlled by the firmware
    rate_reduce_monitor_periodrp_rate_reduce_monitor_period
    rpg_ai_raterp_ai_rate
    rpg_gdrp_gd
    rpg_max_rateN/A
    rpg_min_raterp_min_rate
    rpg_time_resetrp_time_reset
    clamp_tgt_rate_after_time_incrp_clamp_tgt_rate_ati
    dce_tcp_rttrp_dce_tcp_rtt
    initial_alpha_value rp_initial_alpha_value
    rate_to_set_on_first_cnprp_rate_to_set_on_first_cnp
    rpg_byte_resetrp_byte_reset
    rpg_hai_raterp_hai_rate
    rpg_min_dec_facrp_min_dec_fac
    rpg_threshold  rp_threshold

     

     

    Commands for MLNX_OFED 4.0 or below

    Note: DCQCN parameters are located on the following path /sys/class/net/<interface>/ecn

    # ls /sys/class/net/ens785f1/ecn
    roce_np roce_rp

     

    Notification point (NP) parameters:

    # ls /sys/class/net/ens785f1/ecn/roce_np

    cnp_802p_prio          // PCP of CNP packets

     

     

    cnp_dscp               //DSCP of CNP packets

     

     

    enable                 //enable congestion control (responding with CNPs to ECN-marked arrived RoCE packets)

     

     

    min_time_between_cnps  //minimal time between two consecutive CNPs sent

                           //   if ECN-marked RoCE packet arrives in a period smaller than min_time_between_cnps

                           //   since previous sent CNP, no CNP will be sent as a response

     

    Reaction point (RP) parameters:

    # ls /sys/class/net/ens785f1/ecn/roce_rp

    clamp_tgt_rate                // If set, when receiving a CNP, the target rate is always updated

                              //    to be the current rate (contrary to original algorithm)

     

    dce_tcp_g                     // Weight of the new sampling in moving average calculation of alpha

     

    enable                        // Enable congestion control (rate limiting of flows after CNP arrival)

     

    rate_reduce_monitor_period    // Minimal interval for rate reduction for a flow. If a CNP is received

                                  //  during the interval, the flow rate is reduced at the beginning of the next

                                  //  rate_reduce_monitor_period interval to (1-Alpha/Gd)*CurrentRate.

                                  //  rpg_gd is given as log2(Gd), where Gd may only be powers of 2.

     

    rpg_ai_rate                   // Rate increase in AI mode

     

    rpg_gd                        // If an CNP is received, the flow rate is reduced at the beginning of the next

                                  //   rate_reduce_monitor_period interval to (1-Alpha/Gd)*CurrentRate.

                                  //   rpgGd is given as log2(Gd), where Gd may only be powers of 2.

     

    rpg_max_rate                  // The maximum rate, in Mbits per second, at which a congestion-controlled flow

                                  //   can transmit. Once this limit is reached,  the flow is not rate limited any more.

     

    rpg_min_rate                  // The minimum value, in Mbits per second, for rate to limit.

     

    rpg_time_reset                // Time counter for rate increase event

     

    clamp_tgt_rate_after_time_inc // If set, when receiving a CNP, the target rate is updated to be the

                              //   current rate also if the last rate increase event was due to the timer,

                              //   and not only due to the byte counter (contrary to original algorithm)

     

    dce_tcp_rtt                  // Window for sampling of moving average calculation of alpha

     

    initial_alpha_value          // Initial alpha value for a new QP

     

    rate_to_set_on_first_cnp     // The rate that is set for the flow,  upon first CNP received, in Mbps.

     

    rpg_byte_reset               // Byte counter for rate increase event

     

     

    rpg_hai_rate                 // Rate increase in HAI mode

     

     

     

    rpg_min_dec_fac              // Maximal factor by which the rate can be reduced (2 means that the new rate can be divided by 2 at maximum)

     

    rpg_threshold                // Number of rate increase events for switching between Fast Recovery, Active Increase, Hyper Active Increase modes.

     

    Configuration

    Driver Configuration for OFED 4.0 or below

    1. Enable DCQCN on specific priority

    For example, to enable DCQCN on priority 3 as RP and NP, run:

    # echo 1 > /sys/class/net/ens785f1/ecn/roce_np/enable/3

    # echo 1 > /sys/class/net/ens785f1/ecn/roce_rp/enable/3

     

    Notes:

    • You may run that on all priorities (0-7).

    • Notification Point: Sending CNP packets is handled globally per port, any priority enabled here will set it on.

    • Reaction Point: Handling CNP is per priority configured.
    • ECN bits on the IP header are always marked with 01 for RoCE traffic (whenever RoCE CC is enabled on a priority or not).

     

    2. It is recommended to set the CNP DSCP or 802p (PCP) priority values on the NP and set guaranteed QoS in the switches for this value .

    For example:

    # echo 48 > /sys/class/net/ens785f1/ecn/roce_np/cnp_dscp

    # echo 6 > /sys/class/net/ens785f1/ecn/roce_np/cnp_802p_prio

    The reason for this is that in case of congestion, it is desired that the CNP packets can bypass the congested data packets and reach the source of congested flows faster.

     

    3. If you are using more than one TC, map priority to TC using mlnx_qos tool.

    In this example you can see that priority 0 is mapped to tc0.

    # mlnx_qos -i ens785f1 -p 0,1,2,3,4,5,6,7

    PFC configuration:

      priority    0   1   2   3   4   5   6   7

      enabled     0   0   0   0   0   0   0   0  

     

     

    tc: 0 ratelimit: unlimited, tsa: vendor

      priority:  0

    tc: 1 ratelimit: unlimited, tsa: vendor

      priority:  1

    tc: 2 ratelimit: unlimited, tsa: vendor

      priority:  2

    tc: 3 ratelimit: unlimited, tsa: vendor

      priority:  3

    tc: 4 ratelimit: unlimited, tsa: vendor

      priority:  4

    tc: 5 ratelimit: unlimited, tsa: vendor

      priority:  5

    tc: 6 ratelimit: unlimited, tsa: vendor

      priority:  6

    tc: 7 ratelimit: unlimited, tsa: vendor

      priority:  7

     

     

     

    Non Volatile Considerations

    The driver configuration is not persistent. For persistent configuration mlxconfig tool is used.

     

    1. Make sure MFT package is installed, if not. Need to download it from Mellanox.com and install it.

     

    2. Start MFT

    # mst start

    Starting MST (Mellanox Software Tools) driver set

    Loading MST PCI module - Success

    Loading MST PCI configuration module - Success

    Create devices

    Unloading MST PCI module (unused) - Success

     

    3. Get the device status

    # mst status

    MST modules:

    ------------

        MST PCI module is not loaded

        MST PCI configuration module loaded

     

    MST devices:

    ------------

    /dev/mst/mt4115_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

     

     

    4. Enable DCQCN

    Install MFT and use mlxconfig command setting the bitmap parameter ROCE_CC_PRIO_MASK_P1 or ROCE_CC_PRIO_MASK_P2 (for the second port)

    This parameter is 8 bits bitmap. the LSB is mapped to priority 0 (tc0) , while the MSB is priority 7 (tc0).

     

    For example, to enable DCQCN on tc0 (00000001b)

     

    # mlxconfig -d /dev/mst/mt4115_pciconf0 -y s ROCE_CC_PRIO_MASK_P1=0x1

     

    Another example, to enable DCQCN on tc5 (001000000b = 0x20h) run:

     

    # mlxconfig -d /dev/mst/mt4115_pciconf0 -y s ROCE_CC_PRIO_MASK_P1=0x20

     

     

    5. Set CNP priority and DSCP fields.

    For example, set CNP egress priority on 6, and egress DSCP value of 48

     

    # mlxconfig -d /dev/mst/mt4115_pciconf0 -y s CNP_DSCP_P1=48 CNP_802P_PRIO_P1=6

     

     

    Note: MLNX_OFED 3.4 has a known bug with the CNP_DSCP_P1 and  CNP_802P_PRIO_P1 configuration, it doesn't reflect in the driver correctly.

     

    Other DCQCN related parameters can be found in MFT User Manual on Mellanox.com

     

    6. Reset the firmware to load the parameters

    mlxfwreset -d /dev/mst/mt4115_pciconf0 -y reset

     

    Minimal reset level for device, /dev/mst/mt4115_pciconf0:

     

    3: Driver restart and PCI reset

    Continue with reset?[y/N] y

    -I- Stopping Driver                         -Done

    -I- Sending Reset Command To Fw             -Done

    -I- Resetting PCI                           -Done

    -I- Starting Driver                         -Done

    -I- Restarting MST                          -Done

    -I- FW was loaded successfully.

     

    Verification

     

    After driver, reset, you can verify that the configuration is updated by query the device again.

     

    There are two ways to verify that:

    • One is to see that the firmware defaults where changed (via mlxconfig query)

    • Second is to see that the driver is actually reflecting this configuration (via sysfs parameters, cat  /sys…)

     

    1. Verify firmware defaults.

    # mst start

    mlxconfig -d /dev/mst/mt4115_pciconf0 q

     

    ROCE_CC_PRIO_MASK_P1, CNP_DSCP_P1 and CNP_802P_PRIO_P1 should show the current configuration.

     

    2. Check that the driver is enabled with DCQCN (reflect the configuration)

    # cat  /sys/class/net/ens3/ecn/roce_np/enable/*

    1

    0

    0

    0

    0

    0

    0

    0

    # cat  /sys/class/net/ens3/ecn/roce_rp/enable/*

    1

    0

    0

    0

    0

    0

    0

    0

     

     

    Performance Testing

    Use perftest package (e.g. ib_write_bw) with the flag --sl (or -S)  Service Level (skprio) which sets the SL priority to the ECN priority and TC on the switch.

     

    -S 3 in this example sets the user priority of the RDMA packets to 3.

     

    For example:

    Send traffic with User Priority of 0 (and DSCP 0). In case you use different User Priory, change the -S parameter.

    # ib_write_bw --report_gbits -D5 -d mlx5_0  -F  -x 6 -S 3 12.12.12.9

     

    Fast and Simple Testing (No QoS)

    In case you understand the theory (it is complex) and/or just wishes to get the configuration commands to enable ECN on the adapter, check the following procedure.

    In this plan, all traffic goes with no QoS, over priority 0,traffic class 0. Refer to HowTo Configure Resilient RoCE End-to-End Using ConnectX-4 and Spectrum (No QoS) for more information.