HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L3)

Version 7

    The following procedure explains how to configure Resilient RoCE in a basic setup using ECN, and "trust L3" QoS (DSCP based QoS).

    This is a general overview of the running-configuration. For a more detailed explanation, refer to other posts related to ECN and RoCE Congestion Control at: RDMA/RoCE Solutions.

     

    Note: This post is very similar to HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)

    The main differences are:

    - Priority in the network is taken from L3 DSCP bits and not from L2 PCP. The switches are configured with Trust L3 on the QoS.

    - No need for VLAN encapsulation on the router ports (between Spine and ToR).

    The rest is the same.

     

    References

     

    Setup

    In this example we will have one server equipped with ConnectX-4 Lx (10G) and another server equipped with ConnectX-4 (100G) both connected 3 Spectrum Switch

    The plan is to create synthetic congestion between the servers, while the 100G interface will send traffic to the 10G interface via the switch.

    • PFC is disabled on the servers and the switches.
    • ECN are enabled on both servers and the switch.
    • All traffic is mapped to switch buffer pool 0
    • RDMA running over L3 DSCP 6.
    • CNP egress DSCP set to 48.
    • Non RDMA traffic (e.g. TCP) traffic running over L3 DSCP 0.
    • There is no need for VLAN header between the router ports, as the priority is taken from the DSCP field on the IP header.
    • Make sure both servers are installed with the latest MLNX_OFED and the switch is installed with the latest MLNX-OS software.

     

    Note: L2 Priority (PCP) value on the packet is not relevant in this procedure as we are using Trust L3. To learn more about trust, see Understanding QoS Classification (Trust) on Spectrum Switches.

     

     

     

     

     

     

    Configuration

     

    L3 Setup Configuration

    This section is a prerequisite and shows how to set L3 connectivity between the servers via the switches.

     

    1. For easier setup, change the default interface name to eth1, see the procedure here: HowTo Change Network Interface Name in Linux Permanently.

     

    2. Create VLAN interface on the server.

     

    a. Enable the 8021q Linux kernel module. Run on the Linux server:

    # modprobe 8021q

     

    b. Add the file for each server (with different IP address).

    Server S5 for example:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.5

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    Server S6:

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.6

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    3. Set Routing ports on the switches.

     

    In this example, the Tor links to the hosts will be configured with VLAN interface (VLANs 5,6), while the links between the Tors and the Spine will be configured with router interfaces.

    In the router interfaces, VLANs will be added (encapsulation dot1q) to preserve the L2 priority.

     

    Tor-1

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.2.2 255.255.255.0

     

    switch (config) # vlan 5

    switch (config) # interface vlan 5

    switch (config) # interface vlan 5 ip address 1.1.5.1 255.255.255.0

    switch (config) # interface ethernet 1/5 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.13 255.255.255.255

     

    Tor-2

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.2 255.255.255.0

     

    switch (config) # vlan 6

    switch (config) # interface vlan 6

    switch (config) # interface vlan 6 ip address 1.1.6.1 255.255.255.0

    switch (config) # interface ethernet 1/6 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.14 255.255.255.255

     

    Spine

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.1 255.255.255.0

     

    switch (config) # interface ethernet 1/2 no switchport force

    switch (config) # interface ethernet 1/2 ip address 1.1.2.1 255.255.255.0

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.11 255.255.255.255

     

    5. Enable OSPF on the switches.

     

    Tor-1

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.13

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 5 ip ospf area 0.0.0.0

     

    Tor-2

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.14

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 6 ip ospf area 0.0.0.0

     

    Spine

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.11

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface ethernet 1/2 ip ospf area 0.0.0.0

     

    4. Change the speed of the relevant, in our example it is ToR2 port 1/6, to 10G (The plan is to create synthetic congestion between the servers). Run:

    switch (config) # interface ethernet 1/6 speed 10000 force

     

    5. Set route to the far end server (S5 and S6)

    For server S5:

    # ip route add 1.1.0.0/16 via 1.1.5.1

     

    For server S6:

    # ip route add 1.1.0.0/16 via 1.1.6.1

     

    5. Check L3 connectivity (ping between the servers).

     

    At this point, ping should be running between the servers.

     

    Setup QoS (Servers)

    Run the following on both servers:

     

    1. Make sure PFC is disabled on the adapters. Run:

    # mlnx_qos -i eth1 --pfc 0,0,0,0,0,0,0,0

     

    see also mlnx_qos.

     

    Another option to do so, is to use lldptool.

    # service lldpad stop

     

     

    For more details on PFC configuration, see HowTo Configure PFC on ConnectX-4 .


    2. Enable ECN on priority 3

    # echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    # echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

     

    Note: This command is not persistent.

     

    For more details on ECN, see HowTo Configure ECN for ConnectX-4 (Linux)

     

    3. Set CNP L3 DSCP egress priority on 48

    # echo 48 > /sys/class/net/eth1/ecn/roce_np/cnp_dscp

     

    Note: This command is not persistent.

     

    4. Enable ECN on the tcp traffic:

    # sysctl -w net.ipv4.tcp_ecn=1

    net.ipv4.tcp_ecn = 1

     

    Note: This command is not persistent.

     

    5. Set RoCE mode to V2 for RDMA CM traffic.

    # cma_roce_mode -d mlx5_0 -p 1 -m 2

    For more details, see HowTo Set the Default RoCE Mode When Using RDMA CM.

     

    6. Set default ToS to 24 (DSCP 6) mapped to skprio 4.

    # cma_roce_tos -d mlx5_0 -t 24

    For more info, see HowTo Set Egress ToS/DSCP on RDMA-CM QPs.

     

    7. Set Egress priority map (skprio 4 mapped to to L2 priority 3)

    # vconfig set_egress_map eth1.5 4 3

     

    In the following example, TCP will be sent over priority 0 - no need to change the default.

    See HowTo Set Egress Priority VLAN on Linux for more options.

     

    Setup QoS (Switch)

    1. Make sure PFC is disabled in the switch Run:

    # no dcb priority-flow-control enable force

     

    2. Enable ECN on traffic class 3 and Configure CNP with egress strict priority. Run:

    # interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 0 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

    3. Setup the buffer configuration.

    All traffic is mapped to the same pool.

    # pool ePool0 direction egress-mc size 10485760 type dynamic

    # pool iPool0 direction ingress size 10485760 type dynamic

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 bind switch-priority 0          <This is the default configuration>

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 bind switch-priority 6

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 bind switch-priority 3

     

    Note: lossy traffic (TCP, or any other background traffic) is buffered in lossy pool0, that doesn't required any additional reserved buffer. There is 20KB of buffer reserved by default per port(hidden).

     

    4. Set Trust L3 on all ports. The priority will be taken from the DSCP field.

    switch (config) # interface ethernet 1/1-1/32 qos trust L3

    5. Map the DSCP priority to the proper switch priority on all ports.

    switch (config) # interface ethernet 1/1-1/32 qos map dscp 6 to switch-priority 3

    switch (config) # interface ethernet 1/1-1/32 qos map dscp 48 to switch-priority 6   <This is the default>

     

     

    Test the RDMA Layer

    1. Get the GID index using show_gids

     

    In this example, we want to use eth1.5 over RoCE v2. To do that, we need to use GID INDEX 5.

    # show_gids

    DEV    PORT INDEX     GID                                        IPv4            VER DEV

    ---    ---- -----     ---                                        ------------    --- ---

    mlx5_0  1    0       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v1  eth1

    mlx5_0  1    1       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v2  eth1

    mlx5_0  1    2       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v1  eth1

    mlx5_0  1    3       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v2  eth1

    mlx5_0  1    4       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v1  eth1.5

    mlx5_0  1    5       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v2  eth1.5

     

    2. Run performance benchmarks. It is recommended to use multiple QPs (multiple threads).

     

    For example, run the ib_send_bw (or any other IB test from Perftest Package) server on the host with the 10Gb/s link.

     

    Use -S 3 to set the egress map to L2 priority 3. Run:

     

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -T 24 -p $((10000+i)) & done

     

    Run the client on the other host:

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R  -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 10 -T 24 1.1.5.2 -p $((10000+i)) & done  | grep 65536 | awk '{sum+=$4} END {print sum}'

     

    3. To Make sure everything works, it is worth adding mirroring port on the switch and copy the traffic to a 3rd server (see the figure on the top of the post).

     

    To learn more about switch mirroring, see HowTo Configure Port Mirroring on Mellanox Ethernet Switches.

    For example run on one of the switches, on the relevant port.

    switch (config) # show monitor session 1

     

    Session 1

    Admin:  Enable

    Status: Up

    Truncate:   Disable

    Destination interface: eth1/3

    Congestion type: drop-excessive-frames

    Header format: local

               -switch priority: 0

     

    Source interfaces

    Interface  direction

    --------------------------

    eth1/1     both

     

    Run wireshark on the monitor server. and make sure the following:

    1. RDMA traffic is being send with a DSCP 6 (as configured)

    2. RoCE V2 is used UDP port RoCE (4791).

     

     

    3. If you have congestion in the network, you should be able to see CNP traffic (opCode 0x81, see here to understand the packet format, RoCEv2 CNP Packet Format Example) on DSCP value 48 (0x30) as configured.

     

     

    See attached wireshark capture (below).

     

     

    Other Considerations

    TCP Flows

    If you plan to run other traffic types such as TCP, make sure to other DSCP values (e.g. DSCP 0).

     

    Egress Scheduling (QoS)

    When you send more traffic types, you can set different weights for different traffic flows (e.g. 60% RDMA, 40% TCP). Check the mlnx_qos tool for the servers, and see also Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP) for the switch configuration.

     

    Other Switch Running Configuration Examples

    See attached below other switch configuration examples, such as Cisco Nexus 3132Q, 3132C and Arista DCS 7050QX-32-F.

     

    Debugging ECN and PFC

     

    Switch Counters and Buffer Levels

    1. Check the port counters on the switch. Run:

    # show interface ethernet 1/1 counters

    The RoCE Congestion Control mechanism controls the sender rate as per egress port congestion thresholds, thus reduce dramatically pauses being sent toward senders which as a result helping the adapter queues releasing traffic smoothly.

     

    2. Check ECN marking.

    # show interfaces ethernet 1/1 congestion-control

    The ECN marked packets on traffic class 3 should increase towards the 10G port due to congestion.

     

    3. Check the buffer status. Run:

    # show buffer status interfaces ethernet 1/1

     

    Make sure to send RDMA traffic on priority 3, CNP traffic on priority 6 and TCP on priority 0 (check the MaxUsage column for those port groups).

     

    4. Check the QoS counters for PFC, PG, TC and Switch priority, for example, see QoS Counters on Mellanox Spectrum Switches (PFC, PG, TC, Switch-priority)

     

    Server Counters

    1. Get port priority counters on priority 0 (TCP) and 3 (RDMA). Run:

    # watch -n 1 "ethtool -S eth1 | grep prio"

     

    2. To get CNP counters refer to HowTo Read CNP Counters on Mellanox adapters.

     

    Startup-scripts and Running Config

     

    To make things simple, copy paste those startup scripts to the servers and switches, adjust as needed.

     

    Server startup-script example

    echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

    echo 48 > /sys/class/net/eth1/ecn/roce_np/cnp_dscp

    sysctl -w net.ipv4.tcp_ecn=1

    # route add -net <network> netmask <mask> gw <gateway IP> 

    cma_roce_mode -d mlx5_0 -p 1 -m 2

    cma_roce_tos -d mlx5_0 -t 24

    vconfig set_egress_map eth1.5 4 3

     

    Switch running-config

    See attached.

     

    Troubleshooting

    1. Make sure that RoCEv2 is used, check that RoCE is being sent over UDP layer with wireshark.

    2. Make sure to send the packet over VLAN interface on the proper priority.

    4. Use Trust L3 across the network (The priority is taken from the L3 header, DSCP field).

    5. Make sure the CNP packet are being sent, check the counters on the servers, and see the actual CNP packet on wireshark.

    6. Verify that RDMA traffic is being sent on the right priority.