HowTo Configure Mellanox Spectrum Switch for Lossless RoCE

Version 15

    This post discusses real-life configuration of Mellanox Spectrum based Ethernet switches for Lossless RoCE and TCP traffic. The switch will be enabled with PFC and ECN.

     

    References

     

    Configuration Highlights

    • Trust L3 (DSCP)
    • 2 lossy buffer pools
    • 3 traffic classes (tc0,tc3,tc6)
    • 3 priority groups (pg0, pg3, pg6)
    • Traffic types
      • TCP: Uses DSCP value 0 (switch priority 0), WRR, pool 1
      • RDMA: Uses DSCP value 24 (switch priority 3) ,WRR, pool2
      • CNP: Uses DSCP value 48 (switch priority 6), Strict Priority, pool2
    • ECN/RED is configured on the switch
    • PFC enabled for pg3

     

     

     

     

     

     

     

    Configuration

     

    Switch Configuration

    1. Configure buffer pools.

     

    There are two sub-types of pools, per pool.

    • iPool - ingress pool
    • ePool - egress pool

    In this example, we have two pools (each of which has iPool and ePool configuration):

    • Pool 0 - will be used for TCP traffic
    • Pool 1 - will be used for RDMA/RoCE and CNP control frames

     

    Note: As ePool1 is lossless pool, we need to configure maximum possible size (16MB) in order to avoid drops on that pool.

    switch (config) # pool ePool0 direction egress-mc size 4194304 type dynamic

    switch (config) # pool ePool1 direction egress size 16777000 type dynamic

    switch (config) # pool iPool0 direction ingress size 4194304 type dynamic

    switch (config) # pool iPool1 direction ingress size 4194304 type dynamic

     

    2. Configure Trust level per port.

    In this case, we will configure Trust L3, since we want the switch to look at the DSCP field in the packet.

    switch (config) # interface ethernet 1/1 qos trust L3

     

    To learn more about Trust configuration, seeUnderstanding QoS Classification (Trust) on Spectrum Switches.

     

    3. Map DSCP levels to switch-priority.

    In this case, we will use the following mapping (which is also the default mapping).

    In this example we assume that:

    • TCP traffic is sent with DSCP 0
    • RDMA traffic is sent with DSCP 24
    • CNP control traffic is sent with DSCP 48

    switch (config) # interface ethernet 1/1 qos map dscp 24 to switch-priority 3 (default)

    switch (config) # interface ethernet 1/1 qos map dscp 48 to switch-priority 6 (default)

    switch (config) # interface ethernet 1/1 qos map dscp 0 to switch-priority 0 (default)

     

    4. Map switch-priority to priority-groups.

    In this example, we use 3 switch priorities (sp) and map each one to a different priority group (pg).

    To make matters simple, we will map them so that:

    • sp0 is mapped to pg0 (this is the default mapping)
    • sp3 is mapped to pg3
    • sp6 is mapped to pg6

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg0 bind switch-priority 0 (default)

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg3 bind switch-priority 3

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg6 bind switch-priority 6

     

    Note: It is possible to map several switch priorities to the same priority group. The default configuration is that all switch-priorities are mapped to priority group 0 (iPort.pg0).

     

    5. Map ingress priority group traffic to pools.

    In this example, we have three priority groups mapped to two buffer pools:

    • pg0 mapped to pool 0
    • pg3 mapped to pool 1
    • pg6 mapped to pool 1

     

    When mapping the priority groups to pools, we set the mapping type to either lossy or lossless.

    For Resilient RoCE, we will set the buffer type to lossy, since we do not configure flow control in the network.

     

    The suggested reserved buffer for Lossless RoCE is 70KB with xon=17000, xoff=17000 and alpha=2.

    The suggested reserved buffer for TCP traffic and CNP Control is 20K with alpha=8.

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20480 shared alpha 8

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg3 map pool iPool1 type lossless reserved 70K xoff 17000 xon 17000 shared alpha 2

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg6 map pool iPool1 type lossy reserved 20480 shared alpha 8

     

    6. Map egress traffic class to pools.

    In this example, we have three traffic classes mapped to two buffer pools (similar to the ingress pools).

    • tc0 mapped to pool 0
    • tc3 mapped to pool 1
    • tc6 mapped to pool 1

     

    The suggested reserved buffer for Lossless RoCE is 1.5KB with alpha inf.

    The suggested reserved buffer for TCP and CNP Control is 1.5KB with alpha 2.

     

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc3 map pool ePool1 reserved 1500 shared alpha inf

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc6 map pool ePool1 reserved 1500 shared alpha 2

    To learn more about the alpha parameter to tune the buffer side, please refer to Understanding the alpha parameter for buffer configuration on Mellanox Spectrum switches.

     

    7. Map switch priority to traffic class.

    There are several types of traffic in each pool. For example, in our case, in pool1 we have two types of traffic (RDMA and CNP) and we need to make sure that each one of them gets a different traffic class.

    The packets in the pool are marked with the switch priority, and we map them to the right traffic class as follows:

    switch (config) # interface ethernet 1/1 traffic-class 0 bind switch-priority 0

    switch (config) # interface ethernet 1/1 traffic-class 3 bind switch-priority 3

    switch (config) # interface ethernet 1/1 traffic-class 6 bind switch-priority 6

     

    8. Configure the Scheduler.

    In our case, since we do not want to lose CNP control frames (congestion ACK), we will configure this type of traffic as strict priority (for tc6).

    switch (config) # interface ethernet 1/1 traffic-class 6 dcb ets strict

     

    Note: There is no need to change the scheduling options (WRR) for tc0 and tc3 (this is the default).

     

    Refer to Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP)  in case you wish to understand and tune the scheduler with other WRR/SP options.

     

    9. Set ECN/RED on the switch.

    Configure ECN marking upon congestion for the RDMA traffic (tc3).

    In this example, we use minimum and maximum as absolute free space in the buffer pool so that:

    • When the queue length reaches 150KB, some packets will randomly be marked with congestion on the ECN bits on the IP header
    • When the queue length reaches 1500KB, all packets will be marked with congestion on the ECN bits on the IP header

    switch (config) # interface ethernet 1/1 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

     

    Note: The values are in KB.

     

    For more information about ECN/RED refer to HowTo Configure ECN on Mellanox Ethernet Switches (Spectrum)

     

    Note: For each congested traffic packet, a CNP control packet is sent back to the sender to lower the rate.

     

    10. Enable PFC on the switch

     

    a. Create a VLAN, set a switchport in trunk mode, and run:

    switch (config) # vlan 100

    switch (config vlan 100) # exit

    switch (config) # interface ethernet 1/1 switchport mode trunk

    switch (config) # interface ethernet 1/2 switchport mode trunk

     

    Note: There is also an option to enable DSCP based PFC with no VLAN (untagged interface), see How To Configure DSCP-based PFC on Mellanox Spectrum Switches.

     

    b. Make sure global Flow Control is disabled (it is disabled by default) by running:

    switch (config) # interface ethernet 1/1-1/2 flowcontrol send off force

    switch (config) # interface ethernet 1/1-1/2 flowcontrol receive off force

     

    c. Enable PFC on the desired priority (3) by running:

    switch (config) # dcb priority-flow-control enable

    This action might cause traffic loss while shutting down a port with priority-flow-control mode on

    Type 'yes' to confirm  enable pfc globally: yes

    switch (config) # dcb priority-flow-control priority 3 enable

    switch (config) # interface ethernet 1/1 dcb priority-flow-control mode on force

    switch (config) # interface ethernet 1/2 dcb priority-flow-control mode on force

     

    For more information about PFC configuration on Mellanox Spectrum™ based switches, refer to How to Enable PFC on Mellanox Switches (Spectrum)

     

    Host Configuration

    We need to make sure the the host is configured according to the following settings:

    • Enable ECN on the driver - to be able to send and accept CNP control frames to adjust the Tx rate
    • CNP control traffic to be sent with DSCP 48 - to be mapped to switch priority 6 on the switch
    • RDMA traffic to be sent with DCSP 24 - to be mapped to switch priority 3 on the switch
    • TCP traffic to be sent with DSCP 0 - to be mapped to switch priority 0 on the switch
    • Enable PFC on the host, RDMA traffic should be sent with priority 3