HowTo Configure Resilient RoCE End-to-End Using ConnectX-4 and Spectrum (No QoS)

Version 16

    This post describes how to configure and test resilient RoCE end-to-end using two servers equipped with ConnectX-4 adapters (MLNX_OFED v3.4) and Spectrum switch (MLNX-OS).

    In this environment, there is no specific QoS configuration on the network.

     

    References

     

    Overview

    • Create a congestion in the setup. In this example, traffic is being sent from 100G to 40G ports.
    • [Optional] Add a monitoring server to monitor the traffic
    • Enable ECN on the adapter and switch
    • No VLANs are being used
    • Default configuration of QoS is used (IP DSCP priority 0)
      • No Link Level Flow Control (nor PFC) is used
      • Latest MLNX_OFED, and latest MLNX-OS is installed.

       

      Setup

       

       

       

       

      Server Configuration

      Perform the following procedure on each server:

      1. Set the default RoCE mode to v2 (this command is not persistent). For more information, refer to HowTo Set the Default RoCE Mode When Using RDMA CM.

      # cma_roce_mode -d mlx5_0 -p 1 -m 2

      RoCE V2

       

      2. Enable ECN on all priorities.

      # echo 1 > /sys/class/net/ens3/ecn/roce_np/enable/0   <-- 0 means priority 0

      # echo 1 > /sys/class/net/ens3/ecn/roce_rp/enable/0

      # echo 1 > /sys/class/net/ens3/ecn/roce_np/enable/1

      # echo 1 > /sys/class/net/ens3/ecn/roce_rp/enable/1

      # echo 1 > /sys/class/net/ens3/ecn/roce_np/enable/2  

      # echo 1 > /sys/class/net/ens3/ecn/roce_rp/enable/2

      ...

      # echo 1 > /sys/class/net/ens3/ecn/roce_np/enable/7  

      # echo 1 > /sys/class/net/ens3/ecn/roce_rp/enable/7

       

      Note: This command is non-persistent. To keep the ECN enabled after reboot, make sure to enable it in the firmware (using the mlxconfig command).

      # mlxconfig -d /dev/mst/mt4115_pciconf0 -y s ROCE_CC_PRIO_MASK_P1=0xFF

      Refer to HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux) for the full procedure.

       

       

      Spectrum Switch Configuration (MLNX-OS)

      Before starting, it is recommended to verify that you start with the default configuration (default running-config).

       

      1. Force the speed on ports 1/2 and 1/3.

      We force the speed on port 1/2 just to create synthetic congestion, and on port 1/3 for monitoring option.

      switch (config) # interface ethernet 1/2 speed 40000 force    <--- this is done to create synthetic congestion

      switch (config) # interface ethernet 1/3 speed 40000 force    <--- This port is connected to the monitoring server equipped with ConnectX-3 40G (in this example)

       

      2. Set ECN/RED on the relevant switch ports. Configure ECN marking upon congestion for all types of traffic (all traffic classes).

      switch (config) # interface ethernet 1/1-1/2 traffic-class 0 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 1 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 2 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 3 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 4 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 5 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 6 congestion-control both minimum-absolute 150 maximum-absolute 1500

      switch (config) # interface ethernet 1/1-1/2 traffic-class 7 congestion-control both minimum-absolute 150 maximum-absolute 1500

      For more details, see HowTo Configure ECN on Mellanox Ethernet Switches (Spectrum).

       

      3. [Optional] Create a monitoring session for the 40G port 1/2 (source) towards the monitoring port 1/3 (destination).

      switch (config) # monitor session 1

      switch (config) # monitor session 1 add source interface ethernet 1/2 direction both

      switch (config) # monitor session 1 destination interface ethernet 1/3 force

      switch (config) # monitor session 1 no shutdown

       

      For other switch vendors, refer to the vendors switch documentation.

       

      Verification

      • Expected results should be at least 90% of the bandwidth received when flow control is enabled. In this example 36Gb/s in a 40G link was received (expected bandwidth is ~38Gb/s).
      • It is recommended to use at least 8 QPs/threads on different cores, that runs in parallel to better utilize the link.

       

      When using Perftest Package for benchmarking:

      • Force RoCEv2 Options:
        1. Use the -R in case RDMA CM is used and RoCEv2 was configured to be the default (cma_roce_mode -d mlx5_0 -p 1 -m 2)
        2. Use the -X flag with the proper GID index for RoCEv2 (e.g. run show_gids and check the GID index to be used.)
      • Use the -f flag to start measuring after 2 seconds.
      • Refer to the help of the commands to learn about all the other possible options.

       

      Running Example for RDMA CM

       

      Server command:

      # for i in {0..7} ; do taskset -c $i ib_write_bw --report_gbits -D 10 -f 2 -F -R -p $((20000+i)) & done
      ...

      Client command

      # for i in {0..7} ; do taskset -c $i ib_write_bw --report_gbits -D 10 -f 2 -F -R 1.1.1.2 -p $((20000+i)) & done | grep 65536 | awk '{sum+=$4} END {print sum}'

      36.05

       

      Example for Running a Command When Using GID Index

       

      1. Get the GID table (using the show_gids script), and select the right index. In this example, index 4 is used.

      # show_gids

      DEV PORT INDEX GID IPv4   VER DEV

      --- ---- ----- --- ------------   --- ---

      mlx5_0 1 0 fe80:0000:0000:0000:e61d:2dff:feca:c19e V1 ens3

      mlx5_0 1 1 fe80:0000:0000:0000:e61d:2dff:feca:c19e V2 ens3

      mlx5_0 1 2 fe80:0000:0000:0000:e61d:2dff:feca:c19e V1.5 ens3

      mlx5_0 1 3 0000:0000:0000:0000:0000:ffff:0101:0101 1.1.1.1   V1 ens3

      mlx5_0 1 4 0000:0000:0000:0000:0000:ffff:0101:0101 1.1.1.1   V2 ens3

      mlx5_0 1 5 0000:0000:0000:0000:0000:ffff:0101:0101 1.1.1.1   V1.5 ens3

       

      2. Run the following commands using -x 4 (instead of -R).

       

      Server command:

      # for i in {0..7} ; do taskset -c $i ib_write_bw --report_gbits -D 10 -f 2 -F -x 4 -p $((20000+i)) & done
      ...

      Client command

      # for i in {0..7} ; do taskset -c $i ib_write_bw --report_gbits -D 10 -f 2 -F -x 4 1.1.1.2 -p $((20000+i)) & done | grep 65536 | awk '{sum+=$4} END {print sum}'

      36.06

       

      Switch Monitoring

      Monitor the ECN counters on MLNX-OS on the 40G port 1/2. The ECN marked packets are expected to raise constantly, as the client (100G port) will always try to push more until it reaches the bandwidth limit and receive CNP frame from the server (40G port).

      # show interfaces ethernet 1/2 congestion-control

      Interface ethernet: 1/2

       

      ECN marked packets: 4328

      TC-0

              Mode: ECN

              Threshold mode: absolute

              Minimum threshold: 150 KB

              Maximum threshold: 1500 KB

              RED dropped packets: 0

      TC-1

              Mode: none

      TC-2

              Mode: none

      TC-3

              Mode: none

      TC-4

              Mode: none

      TC-5

              Mode: none

      TC-6

              Mode: none

      TC-7

              Mode: none

       

      Wireshark Monitoring

       

      Use Wireshark to capture the packets from the switch (monitored traffic from the 40G port 1/2).

       

      1. Make sure to to see ECN capable frames (ECN =10b or 01b). Refer to the attachment below to download the pcap file captured.

       

       

      2. Make sure the switch has marked the packet as congested (ECN = 11b).

      Use Wireshark filtering ip.dsfield.ecn == 0x3 (the direction is from the 100G server to the 40G server. In this case source IP is 1.1.1.1). Refer to the attachment below to download the pcap file captured.

       

       

      3. CNP frames were observed (InfiniBand BTH opcode is 0x81, over UDP, source port 0).

      Use filtering infiniband.bth.opcode == 0x81. If you do not have the latest Wireshark installed, use  udp.srcport == 0 (UDP source port equal to zero. udp.srcport == 0). The opcode of the BTH InfiniBand header is 0x81.

      Refer to the attachment below to download the pcap file captured.

       

       

      4. Review the Wireshark attachments below.