HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)

Version 12

    The following procedure explains how to configure Resilient RoCE in a basic setup using ECN and "trust L2" Quality of Service (PCP-based QoS).

    This is a general overview describing the running-configuration. For a more detailed explanation, refer to other posts related to ECN and RoCE Congestion Control at: RDMA/RoCE Solutions.

     

    Note: This post is very similar to HowTo Configure Lossless RoCE (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2) with the following differences:

    - In this case, Priority Flow Control (PFC) is disabled in the network (switches and adapters).

    - The switch buffer configuration is different since it uses only one pool.

    The rest of the configuration is the same.

     

    References

     

    Setup

    In this example we use one server equipped with ConnectX-4 Lx (10G) and another server equipped with ConnectX-4 (100G). Both are connected to a 3 Spectrum Switch.

    In this setup you create synthetic congestion between the servers, while the 100G interface sends traffic to the 10G interface via the switch.

    • PFC is disabled on the servers and the switches.
    • ECN are enabled on both the servers and the switch.
    • All traffic is mapped to switch buffer pool 0.
    • RDMA is running over L2 at priority 3.
    • CNP egress priority is set to 6.
    • Non-RDMA traffic (for example TCP) is running over priority 0.
    • Both servers should have the latest version MLNX_OFED installed and the switch must have the latest MLNX-OS software installed on it.

     

    Note: The DSCP value on the packet is not relevant in this procedure as we are using Trust L2. To learn more about trust, see Understanding QoS Classification (Trust) on Spectrum Switches.

     

     

     

     

     

     

    L3 Considerations

    Of the setup is on a larger scale and contains router ports, the priority should be preserved between the router ports via either using the preserve PCP option (using 802.1q vlan encapsulation) or via L3 DSCP field(Trust L3).

    The ingress priority should be mapped to the right switch-priority (for example. RDMA traffic must be mapped to switch-priority 3).

     

    Configuration

     

    L3 Setup Configuration

    Steps described in this section must be followed as it is a prerequisite and shows how to set L3 connectivity between the servers via the switches.

     

    1. For easier setup, change the default interface name to eth1, see the procedure here: HowTo Change Network Interface Name in Linux Permanently.

     

    2. Create a VLAN interface on the server.

     

    a. Enable the 8021q Linux kernel module and run it on the Linux server:

    # modprobe 8021q

     

    b. Add the file for each server (with a different IP address for each).

    Server S5 for example:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.5

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    Server S6:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.6

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    3. Set routing ports on the switches.

     

    In this example, the Tor links (Tor-1 and Tor-2) to the hosts will be configured with VLAN interface (VLANs 5,6), while the links between the Tors and the Spine will be configured with router interfaces.

    In the router interfaces, VLANs will be added (encapsulation dot1q) to preserve the L2 priority.

     

    Tor-1

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.2.2 255.255.255.0

     

    switch (config) # vlan 5

    switch (config) # interface vlan 5

    switch (config) # interface vlan 5 ip address 1.1.5.1 255.255.255.0

    switch (config) # interface ethernet 1/5 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.13 255.255.255.255

     

    Tor-2

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.2 255.255.255.0

     

    switch (config) # vlan 6

    switch (config) # interface vlan 6

    switch (config) # interface vlan 6 ip address 1.1.6.1 255.255.255.0

    switch (config) # interface ethernet 1/6 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.14 255.255.255.255

     

    Spine

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.1 255.255.255.0

     

    switch (config) # interface ethernet 1/2 no switchport force

    switch (config) # interface ethernet 1/2 encapsulation dot1q vlan 6 force

    switch (config) # interface ethernet 1/2 ip address 1.1.2.1 255.255.255.0

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.11 255.255.255.255

     

    5. Enable Open Shortest Path First (OSPF) on the switches as follows:

     

    Tor-1

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.13

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 5 ip ospf area 0.0.0.0

     

    Tor-2

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.14

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 6 ip ospf area 0.0.0.0

     

    Spine

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.11

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface ethernet 1/2 ip ospf area 0.0.0.0

     

    4. Change the speed of the relevant port. In our example, it is ToR2 port 1/6, to 10G. The intent is to create synthetic congestion between the servers. Run the following:

    switch (config) # interface ethernet 1/6 speed 10000 force

     

    5. Set the route on the server's far end (S5 and S6).

    For server S5:

    # ip route add 1.1.0.0/16 via 1.1.5.1

     

    For server S6:

    # ip route add 1.1.0.0/16 via 1.1.6.1

     

    5. Check L3 connectivity by pinging between the servers.

     

    At this point, ping should be running between the servers.

     

    Setup QoS (Servers)

    Run the following on both servers:

     

    1. Make sure Priority Flow Control (PFC) is disabled on the adapters by running:

    # mlnx_qos -i eth1 --pfc 0,0,0,0,0,0,0,0

     

    For additional information, see mlnx_qos.

     

    Another way to achieve the same thing is to use lldptool.

    # service lldpad stop

     

     

    For more details on PFC configuration, see HowTo Configure PFC on ConnectX-4 .


    2. Enable ECN on priority 3.

    # echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    # echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

     

    Note: This command is not persistent.

     

    For more details on ECN, see HowTo Configure DCQCN (RoCE CC) for ConnectX-4 (Linux).

     

    3. Set CNP L2 egress priority to 6.

    # echo 6 > /sys/class/net/eth1/ecn/roce_np/cnp_802p_prio

     

    Note: This command is not persistent.

     

    4. Enable ECN on the TCP traffic:

    # sysctl -w net.ipv4.tcp_ecn=1

    net.ipv4.tcp_ecn = 1

     

    Note: This command is not persistent.

     

    5. Set RoCE mode to V2 for RDMA CM traffic.

    # cma_roce_mode -d mlx5_0 -p 1 -m 2

    For more details, see HowTo Set the Default RoCE Mode When Using RDMA CM.

     

    6. Set the default ToS to 24 (DSCP 6) mapped to skprio 4.

    # cma_roce_tos -d mlx5_0 -t 24

    For more info, see HowTo Set Egress ToS/DSCP on RDMA-CM QPs.

     

    7. Set the egress priority map (skprio 4 mapped to L2 priority 3).

    # vconfig set_egress_map eth1.5 4 3

     

    In the following example, TCP will be sent over priority 0. You will not need to change the default settings.

    See HowTo Set Egress Priority VLAN on Linux for more options.

     

    Setup QoS (Switch)

    1. Make sure PFC is disabled in the switch and run:

    # no dcb priority-flow-control enable force

     

    2. Enable ECN on traffic class 3 and configure CNP with egress strict priority as follows:

    # interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 0 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

    3. Set up the buffer configuration with all traffic mapped to the same pool.

    # pool ePool0 direction egress-mc size 10485760 type dynamic

    # pool iPool0 direction ingress size 10485760 type dynamic

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 bind switch-priority 0          <This is the default configuration>

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 bind switch-priority 6

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 bind switch-priority 3

     

    Note: Lossy traffic (TCP or any other background traffic) is buffered in lossy pool0, which does not require that you add any additional reserved buffers. A buffer with 20KB per port (hidden) is reserved by default.

     

    Test the RDMA Layer

    1. Get the GID index using show_gids.

     

    In this example, we want to use eth1.5 over RoCE v2. To do that, we need to use GID INDEX 5.

    # show_gids

    DEV    PORT INDEX     GID                                        IPv4            VER DEV

    ---    ---- -----     ---                                        ------------    --- ---

    mlx5_0  1    0       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v1  eth1

    mlx5_0  1    1       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v2  eth1

    mlx5_0  1    2       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v1  eth1

    mlx5_0  1    3       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v2  eth1

    mlx5_0  1    4       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v1  eth1.5

    mlx5_0  1    5       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v2  eth1.5

     

    2. Run performance benchmarks. It is recommended that you use multiple QPs (multiple threads).

     

    For example, run the ib_send_bw (or any other InfiniBand test from the Perftest Package) server on the host with the 10Gb/s link.

     

    Use -S 3 to set the egress map to L2 priority 3 as follows:

     

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -S 3 -p $((10000+i)) & done

     

    Run the client on the other host:

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R  -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -S 3 1.1.5.2 -p $((10000+i)) & done  | grep 65536 | awk '{sum+=$4} END {print sum}'

     

    3. To verify that your setup is correct, you should add a mirroring port on the switch and copy the traffic to a 3rd server (refer to the figure in the Setup section).

     

    To learn more about switch mirroring, see HowTo Configure Port Mirroring on Mellanox Ethernet Switches.

    For example run the following on one of the switches on the relevant port.

    switch (config) # show monitor session 1

     

    Session 1

    Admin:  Enable

    Status: Up

    Truncate:   Disable

    Destination interface: eth1/3

    Congestion type: drop-excessive-frames

    Header format: local

               -switch priority: 0

     

    Source interfaces

    Interface  direction

    --------------------------

    eth1/1     both

     

    Run wireshark on the monitor server, and make sure that:

    1. RDMA traffic is being sent with a VLAN on priority 3 (as configured).

    2. RoCE V2 is used on the UDP port RoCE (4791).

     

     

    3. If you have congestion in the network, you should be able to see CNP traffic on priority 6 (as configured) (refer to RoCEv2 CNP Packet Format Example for information about the packet format).

     

     

    See attached wireshark capture (below).

     

     

    Other Considerations

    TCP Flows

    If you plan to run other traffic types such as TCP, make sure that you use priority 0 on the VLAN.

     

    Egress Scheduling (QoS)

    When you send more traffic types, you can set different weights for different traffic flows (for example 60% RDMA, 40% TCP). Check the mlnx_qos tool for the servers, and refer to Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP) for information about the switch configuration.

     

    Other Switch Running Configuration Examples

    See attached below other switch configuration examples, such as Cisco Nexus 3132Q, 3132C and Arista DCS 7050QX-32-F.

     

    Debugging ECN and PFC

     

    Switch Counters and Buffer Levels

    1. Check the port counters on the switch by running:

    # show interface ethernet 1/1 counters

    The RoCE Congestion Control mechanism controls the sender rate as per egress port congestion thresholds, which dramatically reduces  pauses being sent toward senders. As a result, it helps the adapter queues to release traffic smoothly.

     

    2. Check ECN marking.

    # show interfaces ethernet 1/1 congestion-control

    The ECN marked packets on traffic class 3 should increase towards the 10G port due to congestion.

     

    3. Check the buffer status by running:

    # show buffer status interfaces ethernet 1/1

     

    Make sure that you send RDMA traffic on priority 3, CNP traffic on priority 6, and TCP on priority 0 (check the MaxUsage column for those port groups).

     

    4. Check the QoS counters for PFC, PG, TC and Switch priority, for example, see QoS Counters on Mellanox Spectrum Switches (PFC, PG, TC, Switch-priority).

     

    Server Counters

    1. Get port priority counters on priority 0 (TCP) and 3 (RDMA) by running:

    # watch -n 1 "ethtool -S eth1 | grep prio"

     

    2. To get CNP counters refer to HowTo Read CNP Counters on Mellanox adapters.

     

    Using Startup Scripts and Running Config

     

    To make things simple, copy paste those startup scripts to the servers and switches, adjust as needed.

     

    Server Startup Script Example

    echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

    echo 6 > /sys/class/net/eth1/ecn/roce_np/cnp_802p_prio

    sysctl -w net.ipv4.tcp_ecn=1

    # route add -net <network> netmask <mask> gw <gateway IP> 

    cma_roce_mode -d mlx5_0 -p 1 -m 2

    cma_roce_tos -d mlx5_0 -t 24

    vconfig set_egress_map eth1.5 4 3

     

    Switch running-config

    See attached.

     

    Troubleshooting

    1. Ensure that RoCEv2 is used and check that RoCE is being sent over tih UDP layer with wireshark.

    2. Make sure that you send the packet over the VLAN interface on the proper priority.

    3. If you cross routers, make sure that the priority is being mapped from one subnet to the other subnet on the other VLAN.

    4. Use Trust L2 across the network (the priority is taken from the L2 header).

    5. Make sure the CNP packets are being sent, check the counters on the servers, and see the actual CNP packets on wireshark.

    6. Verify that RDMA traffic is being sent on the right priority.