HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L3)

Version 8

    The following procedure explains how to configure Resilient RoCE in a basic setup using ECN and "trust L3" Quality of Service (QoS) (DSCP-based QoS).

    This is a general overview describing the running-configuration. For a more detailed explanation, refer to other posts related to ECN and RoCE Congestion Control at: RDMA/RoCE Solutions.

     

    Note: This post is very similar to HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2) with the following differences:

    - The priority in the network is taken from L3 DSCP bits and not from L2 PCP. The switches are configured with Trust L3 on the QoS.

    - You do not need to have VLAN encapsulation on the router ports (between Spine and ToR).

    The rest of the configuration is the same.

     

    References

     

    Setup

    In this example we will have one server equipped with ConnectX-4 Lx (10G) and another server equipped with ConnectX-4 (100G). Both are connected to a 3 Spectrum Switch.

    The plan is to create synthetic congestion between the servers, while the 100G interface sends traffic to the 10G interface via the switch where:

    • Priority Flow Control (PFC) is disabled on the servers and the switches.
    • ECN is enabled on both servers and the switch.
    • All traffic is mapped to switch buffer pool 0.
    • RDMA is running over L3 DSCP 6.
    • CNP egress DSCP is set to 48.
    • Non-RDMA traffic (for example TCP) is running over L3 DSCP 0.
    • A VLAN header between the router ports is not needed, as the priority is taken from the DSCP field in the IP header.
    • Both servers are should have the latest MLNX_OFED version installed and the switch is must have the latest MLNX-OS software installed.

     

    Note: L2 Priority (PCP) value on the packet is not relevant in this procedure as you are using Trust L3. To learn more about trust, see Understanding QoS Classification (Trust) on Spectrum Switches.

     

     

     

     

     

     

    Configuration

     

    L3 Setup Configuration

    Steps described in this section should be followed as they show how to set L3 connectivity between the servers via the switches.

     

    1. For easier setup, change the default interface name to eth1, see the procedure here: HowTo Change Network Interface Name in Linux Permanently.

     

    2. Create a VLAN interface on the server.

     

    a. Enable the 8021q Linux kernel module and run it on the Linux server:

    # modprobe 8021q

     

    b. Add the file for each server (with a different IP address for each).

    Server S5 for example:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.5

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    Server S6:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.6

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    3. Set routing ports on the switches.

     

    In this example, the Tor links (Tor-1 and Tor-2) to the hosts are configured with VLAN interface (VLANs 5,6), while the links between the Tors and the Spine are configured with router interfaces.

    In the router interfaces, VLANs will be added (encapsulation dot1q) to preserve the L2 priority.

     

    Tor-1

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.2.2 255.255.255.0

     

    switch (config) # vlan 5

    switch (config) # interface vlan 5

    switch (config) # interface vlan 5 ip address 1.1.5.1 255.255.255.0

    switch (config) # interface ethernet 1/5 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.13 255.255.255.255

     

    Tor-2

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.2 255.255.255.0

     

    switch (config) # vlan 6

    switch (config) # interface vlan 6

    switch (config) # interface vlan 6 ip address 1.1.6.1 255.255.255.0

    switch (config) # interface ethernet 1/6 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.14 255.255.255.255

     

    Spine

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.1 255.255.255.0

     

    switch (config) # interface ethernet 1/2 no switchport force

    switch (config) # interface ethernet 1/2 ip address 1.1.2.1 255.255.255.0

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.11 255.255.255.255

     

    4. Enable Open Shortest Path First (OSPF) on the switches.

     

    Tor-1

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.13

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 5 ip ospf area 0.0.0.0

     

    Tor-2

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.14

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 6 ip ospf area 0.0.0.0

     

    Spine

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.11

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface ethernet 1/2 ip ospf area 0.0.0.0

     

    4. Change the speed of the relevant, in our example it is ToR2 port 1/6, to 10G (The plan is to create synthetic congestion between the servers). Run:

    switch (config) # interface ethernet 1/6 speed 10000 force

     

    5. Set the route on the server's far end (S5 and S6).

    For server S5:

    # ip route add 1.1.0.0/16 via 1.1.5.1

     

    For server S6:

    # ip route add 1.1.0.0/16 via 1.1.6.1

     

    6. Check L3 connectivity by pinging between the servers.

     

    At this point, ping should be running between the servers.

     

    Setup QoS (Servers)

    Run the following on both servers:

     

    1. Make sure PFC is disabled on the adapters by running:

    # mlnx_qos -i eth1 --pfc 0,0,0,0,0,0,0,0

     

    For additional information see mlnx_qos.

     

    Another way to achieve the same thing is to use lldptool.

    # service lldpad stop

     

     

    For more details on the Priority-based Flow Control (PFC) configuration, see HowTo Configure PFC on ConnectX-4 .


    2. Enable ECN on priority 3.

    # echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    # echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

     

    Note: This command is not persistent.

     

    For more details on ECN, see HowTo Configure DCQCN (RoCE CC) for ConnectX-4 (Linux).

     

    3. Specify the CNP L3 DSCP egress's priority as 48.

    # echo 48 > /sys/class/net/eth1/ecn/roce_np/cnp_dscp

     

    Note: This command is not persistent.

     

    4. Enable ECN on the tcp traffic:

    # sysctl -w net.ipv4.tcp_ecn=1

    net.ipv4.tcp_ecn = 1

     

    Note: This command is not persistent.

     

    5. Set the RoCE mode to V2 for RDMA CM traffic.

    # cma_roce_mode -d mlx5_0 -p 1 -m 2

    For more details, see HowTo Set the Default RoCE Mode When Using RDMA CM.

     

    6. Set the default ToS to 24 (DSCP 6) mapped to skprio 4.

    # cma_roce_tos -d mlx5_0 -t 24

    For more information, see HowTo Set Egress ToS/DSCP on RDMA-CM QPs.

     

    7. Set Egress priority mapping (skprio 4 mapped to to L2 priority 3) as follows:

    # vconfig set_egress_map eth1.5 4 3

     

    In the following example, TCP will be sent over priority 0. you will not need to change the default setting.

    See HowTo Set Egress Priority VLAN on Linux for more options.

     

    Setup QoS (Switch)

    1. Make sure PFC is disabled in the switch and run:

    # no dcb priority-flow-control enable force

     

    2. Enable ECN on traffic class 3 and configure CNP with egress strict priority as follows:

    # interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 0 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

    3. Set up the buffer configuration with all traffic mapped to the same pool.

    # pool ePool0 direction egress-mc size 10485760 type dynamic

    # pool iPool0 direction ingress size 10485760 type dynamic

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 bind switch-priority 0          <This is the default configuration>

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 bind switch-priority 6

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 bind switch-priority 3

     

    Note: Lossy traffic (TCP or any other background traffic) is buffered in lossy pool0, which does not require that you have any additional reserved buffers. A buffer with 20KB is reserved per port by default (hidden).

     

    4. Set Trust L3 on all ports. The priority is specified in the DSCP field.

    switch (config) # interface ethernet 1/1-1/32 qos trust L3

    5. Map the DSCP priority to the proper switch priority on all ports.

    switch (config) # interface ethernet 1/1-1/32 qos map dscp 6 to switch-priority 3

    switch (config) # interface ethernet 1/1-1/32 qos map dscp 48 to switch-priority 6   <This is the default>

     

     

    Test the RDMA Layer

    1. Get the GID index using show_gids.

     

    In this example, we want to use eth1.5 over RoCE v2. To do that, we need to use GID INDEX 5.

    # show_gids

    DEV    PORT INDEX     GID                                        IPv4            VER DEV

    ---    ---- -----     ---                                        ------------    --- ---

    mlx5_0  1    0       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v1  eth1

    mlx5_0  1    1       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v2  eth1

    mlx5_0  1    2       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v1  eth1

    mlx5_0  1    3       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v2  eth1

    mlx5_0  1    4       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v1  eth1.5

    mlx5_0  1    5       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v2  eth1.5

     

    2. Run performance benchmarks. It is recommended that you use multiple QPs (multiple threads).

     

    For example, run the ib_send_bw (or any other InfiniBand test from the Perftest Package) server on the host with the 10Gb/s link.

     

    Use -S 3 to set the egress map to L2 priority 3. Run:

     

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -T 24 -p $((10000+i)) & done

     

    Run the client on the other host:

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R  -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -T 24 1.1.5.2 -p $((10000+i)) & done  | grep 65536 | awk '{sum+=$4} END {print sum}'

     

    3. Verify that your setup is correct. You should add a mirroring port on the switch and copy the traffic to a 3rd server (refer to the figure in the setup section).

     

    To learn more about switch mirroring, see HowTo Configure Port Mirroring on Mellanox Ethernet Switches.

    For example run the following on one of the switches on the relevant port.

    switch (config) # show monitor session 1

     

    Session 1

    Admin:  Enable

    Status: Up

    Truncate:   Disable

    Destination interface: eth1/3

    Congestion type: drop-excessive-frames

    Header format: local

               -switch priority: 0

     

    Source interfaces

    Interface  direction

    --------------------------

    eth1/1     both

     

    Run wireshark on the monitor server, and make sure that:

    1. RDMA traffic is being sent with a DSCP 6 (as configured).

    2. RoCE V2 is used on the UDP port RoCE (4791).

     

     

    3. If you have congestion in the network, you should be able to see CNP traffic (opCode 0x81, refer to RoCEv2 CNP Packet Format Example) on DSCP value 48 (0x30) as configured to understand the packet format.

     

     

    See the attached wireshark capture (below).

     

     

    Other Considerations

    TCP Flows

    If you plan to run other traffic types such as TCP, make sure that you use other DSCP values (for example DSCP 0).

     

    Egress Scheduling (QoS)

    When you send more traffic types, you can set different weights for different traffic flows (for example 60% RDMA, 40% TCP). Check the mlnx_qos tool for the servers, and refer to Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP) that describes the switch configurations.

     

    Other Switch Running Configuration Examples

    See the attachment for other switch configuration examples, such as Cisco Nexus 3132Q, 3132C and Arista DCS 7050QX-32-F.

     

    Debugging ECN and PFC

     

    Switch Counters and Buffer Levels

    1. Check the port counters on the switch by running:

    # show interface ethernet 1/1 counters

    The RoCE Congestion Control mechanism controls the sender rate as per egress port congestion thresholds, which reduces pauses being sent toward senders dramatically. This can help the adapter queues to release traffic smoothly.

     

    2. Check ECN marking.

    # show interfaces ethernet 1/1 congestion-control

    The ECN marked packets on traffic class 3 should increase towards the 10G port due to congestion.

     

    3. Check the buffer status by running:

    # show buffer status interfaces ethernet 1/1

     

    Make sure that you send RDMA traffic on priority 3, CNP traffic on priority 6, and TCP on priority 0 (check the MaxUsage column for those port groups).

     

    4. Check the QoS counters for PFC, PG, TC and Switch priority, for example, see QoS Counters on Mellanox Spectrum Switches (PFC, PG, TC, Switch-priority).

     

    Server Counters

    1. Get port priority counters on priority 0 (TCP) and 3 (RDMA) by running:

    # watch -n 1 "ethtool -S eth1 | grep prio"

     

    2. To get CNP counters refer to HowTo Read CNP Counters on Mellanox adapters.

     

    Startup Scripts and Running Config

     

    To simplify things, copy-and-paste those startup scripts to the servers and switches, and adjust as needed.

     

    Server Startup Script example

    echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

    echo 48 > /sys/class/net/eth1/ecn/roce_np/cnp_dscp

    sysctl -w net.ipv4.tcp_ecn=1

    # route add -net <network> netmask <mask> gw <gateway IP> 

    cma_roce_mode -d mlx5_0 -p 1 -m 2

    cma_roce_tos -d mlx5_0 -t 24

    vconfig set_egress_map eth1.5 4 3

     

    Switch Running Config

    See attached.

     

    Troubleshooting

    1. Ensure that RoCEv2 is used and check that RoCE is being sent over UDP layer with wireshark.

    2. Make sure that you send the packet over VLAN interface with the proper priority.

    4. Use Trust L3 across the network (the priority is taken from the DSCP field in the L3 header).

    5. Make sure the CNP packets are being sent, check the counters on the servers, and verify that the actual CNP packets are on wireshark.

    6. Verify that RDMA traffic is being sent with the right priority.