How To Configure Lossless RoCE (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)

Version 49

    This is a configuration guide for Lossless RoCE in a basic setup using PFC, ECN and "Trust L2" QoS (PCP based QoS).

    This is only a general overview of the running-configuration, for a more in-depth information, please refer to PFC/ECN and RoCE Congestion Control posts at: RDMA/RoCE Solutions.

     

    Note:

    This post correlates to HowTo Configure Resilient RoCE (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2) with two main differences:

    - PFC is enabled in the network (switches and adapters)

    - Switch buffer configuration uses two pools

     

    References

     

    Setup

    The setup in this guide consists of two servers connected to Spectrum Switch (priority 3), one server equipped with ConnectX-4 Lx (10G) and one server equipped with ConnectX-4 (100G).

    • PFC and ECN are enabled on both servers and the switch.
    • Non RDMA traffic (e.g. TCP)/CNP traffic will be mapped to switch buffer pool 0
    • RDMA traffic will be mapped to switch buffer pool 1
    • RDMA running over L2 priority 3.
    • PFC enabled on priority 3.
    • ECN is enabled.
    • CNP egress priority set to 6.
    • Non RDMA traffic (e.g. TCP) traffic running over priority 0.
    • Make sure both servers are installed with the latest MLNX_OFED and the switch is installed with the latest MLNX-OS software.

     

    The plan is to create synthetic congestion between the servers and utilize the switch to send traffic from the 100G interface to the 10G interface.

     

    Note: Packet DSCP value is not relevant to this process as we are using Trust L2.

    to learn more about trust, see Understanding QoS Classification (Trust) on Spectrum Switches.

     

     

     

     

     

     

    L3 Considerations

    If a larger setup with router ports is being used, priority should be preserved between the router ports by either preserve PCP (using 802.1q vlan encapsulation) or L3 DSCP field (Trust L3).

    The ingress priority should be mapped to the right switch-priority (e.g. RDMA traffic will be mapped to switch-priority 3).

     

    Configuration

     

    L3 Setup Configuration

    This section is a prerequisite process which illustrates how to set L3 connectivity between the servers via the switches.

    For an easier setup, change the default interface name to eth1 (see HowTo Change Network Interface Name in Linux Permanently.) and create VLAN interface on the server.

     

    a. To Enable the 8021q Linux kernel module, run the following on the Linux server:

    # modprobe 8021q

     

    b. Add the file for each server (with different IP address).

    For example: Server S5.

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.5

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    Replicate the previous step on Server S6.

    # cat  /etc/sysconfig/network-scripts/ifcfg-eth1.6

     

    VLAN=yes

    TYPE=Vlan

    DEVICE=eth1.5

    PHYSDEV=eth1

    VLAN_ID=5

    REORDER_HDR=0

    BOOTPROTO=static

    DEFROUTE=yes

    IPV4_FAILURE_FATAL=no

    NAME=eth1.100

    ONBOOT=yes

    IPADDR=1.1.5.2

    NETMASK=255.255.255.0

    NM_CONTROLLED=no

     

    3. Set Routing ports on the switches:

     

    The next steps configure the Tor links to the hosts with VLAN interface (VLANs 5,6), and the links between the Tors and the Spine with router interfaces.

    In the router interfaces, VLANs will be added (encapsulation dot1q) to preserve the L2 priority.

     

    Tor-1

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.2.2 255.255.255.0

     

    switch (config) # vlan 5

    switch (config) # interface vlan 5

    switch (config) # interface vlan 5 ip address 1.1.5.1 255.255.255.0

    switch (config) # interface ethernet 1/5 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.13 255.255.255.255

     

    Tor-2

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.2 255.255.255.0

     

    switch (config) # vlan 6

    switch (config) # interface vlan 6

    switch (config) # interface vlan 6 ip address 1.1.6.1 255.255.255.0

    switch (config) # interface ethernet 1/6 switchport mode trunk

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.14 255.255.255.255

     

    Spine

    switch (config) # ip routing vrf default

     

    switch (config) # interface ethernet 1/1 no switchport force

    switch (config) # interface ethernet 1/1 encapsulation dot1q vlan 1 force

    switch (config) # interface ethernet 1/1 ip address 1.1.1.1 255.255.255.0

     

    switch (config) # interface ethernet 1/2 no switchport force

    switch (config) # interface ethernet 1/2 encapsulation dot1q vlan 6 force

    switch (config) # interface ethernet 1/2 ip address 1.1.2.1 255.255.255.0

     

    switch (config) # interface loopback 1

    switch (config) # interface loopback 1 ip address 127.1.1.11 255.255.255.255

     

    5. Enable OSPF on the switches:

     

    Tor-1

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.13

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 5 ip ospf area 0.0.0.0

     

    Tor-2

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.14

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface vlan 6 ip ospf area 0.0.0.0

     

    Spine

    switch (config) # protocol ospf

    switch (config) # router ospf 1 vrf default

    switch (config) # router ospf 1 vrf default router-id 127.1.1.11

    switch (config) # interface ethernet 1/1 ip ospf area 0.0.0.0

    switch (config) # interface ethernet 1/2 ip ospf area 0.0.0.0

     

    4. Change the speed of the relevant port (ToR2 port 1/6, to 10G in the example below):

    The intent is to create a synthetic congestion between the servers.

    switch (config) # interface ethernet 1/6 speed 10000 force

     

    5. Set route to the far end server (S5 and S6):

    For server S5:

    # ip route add 1.1.0.0/16 via 1.1.5.1

     

    For server S6:

    # ip route add 1.1.0.0/16 via 1.1.6.1

     

    5. Check L3 connectivity (ping between the servers).

     

    At this point, ping should be running between the servers.

     

    Setup QoS (Servers)

    Run the following on both servers:

     

    1. Configure PFC on the adapters (set priority to 3 on the firmware):

    # mlnx_qos -i eth1 --pfc 0,0,0,1,0,0,0,0

     

    see also mlnx_qos.

     

    Another option is to use lldptool:

    # service lldpad start

    # lldptool -T -i eth1 -V PFC enabled=3

     

    For more details on PFC configuration, see HowTo Configure PFC on ConnectX-4 .


    2. Enable ECN on priority 3

    # echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    # echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

     

    Note: This command is not persistent.

     

    See HowTo Configure DCQCN (RoCE CC) for ConnectX-4 (Linux) for more details on ECN.

     

    3. Set CNP L2 egress priority on 6

    # echo 6 > /sys/class/net/eth1/ecn/roce_np/cnp_802p_prio

     

    Note: This command is not persistent.

     

    4. Enable ECN on the tcp traffic:

    # sysctl -w net.ipv4.tcp_ecn=1

    net.ipv4.tcp_ecn = 1

     

    Note: This command is not persistent.

     

    5. Set RoCE mode to V2 for RDMA CM traffic.

    # cma_roce_mode -d mlx5_0 -p 1 -m 2

    See HowTo Set the Default RoCE Mode When Using RDMA CM for more info.

     

    6. Set default ToS to 24 (DSCP 6) mapped to skprio 4:

    # cma_roce_tos -d mlx5_0 -t 24

    See HowTo Set Egress ToS/DSCP on RDMA-CM QPs for more info.

     

    7. Set Egress priority map (skprio 4 mapped to to L2 priority 3)

    # vconfig set_egress_map eth1.5 4 3

     

    In the following example, TCP will be sent over priority 0 - no need to change the default.

    See HowTo Set Egress Priority VLAN on Linux for more options.

     

    Setup QoS (Switch)

    1. Enable PFC on priority 3:

    # dcb priority-flow-control enable force

    # dcb priority-flow-control priority 3 enable

    # interface ethernet 1/1-1/32 dcb priority-flow-control mode on force

     

    2. Enable ECN on traffic class 3 and Configure CNP with egress strict priority:

    # interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 0 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

    # interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

    3. Setup the buffer configuration:

    RDMA is mapped to the lossless pool (pool1) and TCP and CNP traffic are mapped to the lossy pool (pool0).

    # pool ePool0 direction egress-mc size 5242880 type dynamic

    # pool ePool1 direction egress size 16777000 type dynamic

    # pool iPool0 direction ingress size 5242880 type dynamic

    # pool iPool1 direction ingress size 5242880 type dynamic

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 map pool iPool0 type lossy reserved 20480 shared alpha 8

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 map pool iPool1 type lossless reserved 96256 xoff 20480 xon 20480 shared alpha 2

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 bind switch-priority 0          <This is the default configuration>

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg6 bind switch-priority 6

    # interface ethernet 1/1-1/32 ingress-buffer iPort.pg3 bind switch-priority 3

    # interface ethernet 1/1-1/32 egress-buffer ePort.tc3 map pool ePool1 reserved 1500 shared alpha inf

     

    Note: lossy traffic (TCP, or any other background traffic) is buffered in lossy pool0, which does not required any additional reserved buffer.

    By default, there are 20KBs of reserved buffer per port (hidden).

     

    Test the RDMA Layer

    1. Get the GID index using show_gids

     

    In this example, we want to use eth1.5 over RoCE v2.

    To do that, we need to use GID INDEX 5.

    # show_gids

    DEV    PORT INDEX     GID                                        IPv4            VER DEV

    ---    ---- -----     ---                                        ------------    --- ---

    mlx5_0  1    0       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v1  eth1

    mlx5_0  1    1       fe80:0000:0000:0000:e61d:2dff:fef2:a488                     v2  eth1

    mlx5_0  1    2       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v1  eth1

    mlx5_0  1    3       0000:0000:0000:0000:0000:ffff:0101:0105       1.1.1.5       v2  eth1

    mlx5_0  1    4       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v1  eth1.5

    mlx5_0  1    5       0000:0000:0000:0000:0000:ffff:0202:0205       1.1.5.2       v2  eth1.5

     

    2. Run performance benchmarks. It is recommended to use multiple QPs (multiple threads).

     

    For example, run ib_send_bw (or any other IB test from Perftest Package) server on the host with the 10Gb/s link.

     

    Use -S 3 to set the egress map to L2 priority 3:

     

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 120 -S 3 -p $((10000+i)) & done

     

    Run the client on the other host:

    # for i in {0..7} ; do taskset -c $i ib_send_bw -R  -x 5 -d mlx5_0 -F --report_gbits -f 2 -D 120 -S 3 1.1.5.2 -p $((10000+i)) & done  | grep 65536 | awk '{sum+=$4} END {print sum}'

     

    3. To Make sure everything works, it is worth adding a mirroring port on the switch and copy the traffic to a 3rd server (see first figure in this guide).
         see HowTo Configure Port Mirroring on Mellanox Ethernet Switches to learn more about switch mirroring.

    For example run the following on one of the switches, on a relevant port.

    switch (config) # show monitor session 1

     

    Session 1

    Admin:  Enable

    Status: Up

    Truncate:   Disable

    Destination interface: eth1/3

    Congestion type: drop-excessive-frames

    Header format: local

               -switch priority: 0

     

    Source interfaces

    Interface  direction

    --------------------------

    eth1/1     both

     

    Run wireshark on the monitor server and make sure the following applies:

    1. RDMA traffic is being sent with a VLAN on priority 3 (as configured)

    2. RoCE V2 is using UDP port RoCE (4791).

     

     

    3. If you have congestion in the network, you should be able to see CNP traffic (see here to understand the packet format, RoCEv2 CNP Packet Format Example) on priority 6 (as configured).

     

    See attached wireshark images at the end of this guide.

     

    Other Considerations

    TCP Flows

    If you plan to run other traffic types such as TCP, make sure to use priority 0 on the VLAN as the switch is configured to map priority 0 to a different pool (different buffer), so that RDMA and TCP do not share the same resources on a switch.

     

    Egress Scheduling (QoS)

    When you send multiple types of traffic, you can set a different weight for each traffic flow (e.g. 60% RDMA, 40% TCP).

    Check the mlnx_qos tool for servers, and see Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP) for the switch configuration.

     

    Other Switch Running Configuration Examples

    For more switch configuration examples, such as Cisco Nexus 3132Q, 3132C and Arista DCS 7050QX-32-F, see attached images at the end of this guide.

     

    Debugging ECN and PFC

     

    Switch Counters and Buffer Levels

    1. Check the port counters on the switch:

    # show interface ethernet 1/1 counters

    Check the pause frames and Rx/Tx traffic. Expect low number of pause frames compared to a configuration with only PFC (No ECN).

    The RoCE Congestion Control mechanism controls the sender's rate as per egress port congestion threshold, thus dramatically reducing pauses being sent towards senders enabling the adapter queues to release traffic smoothly.

     

    2. Check ECN marking:

    # show interfaces ethernet 1/1 congestion-control

    The ECN marked packets on class 3 traffic should be transferred to the 10G port due to congestion.

     

    3. Check the buffer status:

    # show buffer status interfaces ethernet 1/1

     

    Make sure to send RDMA traffic on priority 3, CNP traffic on priority 6 and TCP on priority 0.

    See the MaxUsage column for those port groups.

     

    4. Check the QoS counters for PFC, PG, TC and Switch priority:

    Visit QoS Counters on Mellanox Spectrum Switches (PFC, PG, TC, Switch-priority) for examples.

     

    Server Counters

    1. Get port priority counters on priority 0 (TCP) and 3 (RDMA):

    # watch -n 1 "ethtool -S eth1 | grep prio"

     

    2. To get CNP counters, refer to HowTo Read CNP Counters on Mellanox adapters.

     

     

    Startup-scripts and Running Config

     

    To make things simple, copy and paste the following startup scripts to the servers and switches and modify if required.

     

    Server startup-script example

    mlnx_qos -i eth1 --pfc 0,0,0,1,0,0,0,0

    echo 1 > /sys/class/net/eth1/ecn/roce_np/enable/3

    echo 1 > /sys/class/net/eth1/ecn/roce_rp/enable/3

    echo 6 > /sys/class/net/eth1/ecn/roce_np/cnp_802p_prio

    sysctl -w net.ipv4.tcp_ecn=1

    # route add -net <network> netmask <mask> gw <gateway IP> 

    cma_roce_mode -d mlx5_0 -p 1 -m 2

    cma_roce_tos -d mlx5_0 -t 24

    vconfig set_egress_map eth1.5 4 3

     

    Switch running-config

    See attached at the end of this guide.

     

    Troubleshooting

    1. Make sure that RoCEv2 is used, check that RoCE is being sent over UDP layer with wireshark.

    2. Make sure to send the packet over VLAN interface on the proper priority.

    3. In case you cross routers, make sure that the priority is being mapped from one subnet, to the other subnet on the other VLAN

    4. Use Trust L2 across the network (The priority is taken from the L2 header).

    5. Make sure the CNP packet are being sent, check the counters on the servers, and see the actual CNP packet on wireshark.

    6. Verify that RDMA traffic is being sent on the right priority.