HowTo Configure Mellanox Spectrum Switch for Resilient RoCE

Version 19

    This post discusses real-life configuration of Mellanox Spectrum based Ethernet Switches for Resilient RoCE and TCP traffic. The switch will be enabled with ECN.

     

    References

     

    Configuration Highlights

    • Trust L3 (DSCP)
    • 2 lossy buffer pools
    • 3 traffic classes (tc0,tc3,tc6)
    • 3 priority groups (pg0, pg3, pg6)
    • Traffic types
      • TCP: Uses DSCP value 0 (switch priority 0), WRR, pool 1
      • RDMA: Uses DSCP value 24 (switch priority 3) ,WRR, pool2
      • CNP: Uses DSCP value 48 (switch priority 6), Strict Priority, pool2
    • ECN/RED is configured on the switch

     

     

     

     

     

     

    Configuration

     

    Switch Configuration

    1. Configure buffer pools.

     

    There are two sub-types of pools, per pool.

    • iPool - ingress pool
    • ePool - egress pool

     

    In this example, we have two pools (each of which has iPool and ePool configuration):

    • Pool 0 - will be used for TCP traffic
    • Pool 1 - will be used for RDMA/RoCE and CNP control frames

     

    For both pools we will use 4MB (2^22=4194304) dynamic pools, for both iPool and ePool:

    switch (config) # pool ePool0 direction egress-mc size 4194304 type dynamic

    switch (config) # pool ePool1 direction egress size 4194304 type dynamic

    switch (config) # pool iPool0 direction ingress size 4194304 type dynamic

    switch (config) # pool iPool1 direction ingress size 4194304 type dynamic

     

    2. Configure Trust level per port.

    In this case, we will configure Trust L3, since we want the switch to look at the DSCP field in the packet.

    switch (config) # interface ethernet 1/1 qos trust L3

     

    To learn more about Trust configuration, seeUnderstanding QoS Classification (Trust) on Spectrum Switches.

     

    3. Map DSCP levels to switch-priority.

    In this case, we will use the following mapping (which is also the default mapping).

    In this example we assume that:

    • TCP traffic is sent with DSCP 0
    • RDMA traffic is sent with DSCP 24
    • CNP control traffic is sent with DSCP 48

    switch (config) # interface ethernet 1/1 qos map dscp 24 to switch-priority 3 (default)

    switch (config) # interface ethernet 1/1 qos map dscp 48 to switch-priority 6 (default)

    switch (config) # interface ethernet 1/1 qos map dscp 0 to switch-priority 0 (default)

     

    4. Map switch priority to priority groups.

    In this example, we use 3 switch priorities (sp), each one is mapped to a different priority group (pg). To make things easy, we will use the same numbers:

    • sp0 is mapped to pg0 (this is the default mapping)
    • sp3 is mapped to pg3
    • sp6 is mapped to pg6

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg0 bind switch-priority 0 (default)

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg3 bind switch-priority 3

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg6 bind switch-priority 6

     

    Note: It is possible to map several switch priorities to the same priority group. The default is that all switch-priorities are mapped to priority group 0 (iPort.pg0).

     

    5. Map ingress priority group traffic to pools.

    In this example, we have three priority groups mapped to two buffer pools:

    • pg0 mapped to pool 0
    • pg3 mapped to pool 1
    • pg6 mapped to pool 1

     

    When mapping the priority groups to pools, we set the mapping type to either lossy or lossless.

    For Resilient RoCE, we will set the buffer type to lossy, since we don't configure flow control in the network.

     

    The suggested reserved buffer for resilient RoCE is 20KB with alpha 8.

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20480 shared alpha 8

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg3 map pool iPool1 type lossy reserved 20480 shared alpha 8

    switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg6 map pool iPool1 type lossy reserved 20480 shared alpha 8

     

    6. Map egress traffic class to pools.

    In this example, we have three traffic classes mapped to two buffer pools (similar to the ingress pools).

    • tc0 mapped to pool 0
    • tc3 mapped to pool 1
    • tc6 mapped to pool 1

     

    The suggested reserved buffer for resilient RoCE is is 1.5KB with alpha 2.

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc3 map pool ePool1 reserved 1500 shared alpha 2

    switch (config) # interface ethernet 1/1 egress-buffer ePort.tc6 map pool ePool1 reserved 1500 shared alpha 2

     

    To learn more about alpha parameter to tune the buffer side, refer to Understanding the alpha parameter for buffer configuration on Mellanox Spectrum switches.

     

    7. Map Switch Priority to Traffic Class.

    There are several types of traffic in each pool. For example, in our case, in pool1 we have two types of traffic (RDMA and CNP) and we need to make sure that each one of them gets a different traffic class.

    The packets in the pool are marked with the switch priority, and we map them to the right traffic class as follows:

    switch (config) # interface ethernet 1/1 traffic-class 0 bind switch-priority 0

    switch (config) # interface ethernet 1/1 traffic-class 3 bind switch-priority 3

    switch (config) # interface ethernet 1/1 traffic-class 6 bind switch-priority 6

     

    8. Configure the Scheduler.

    In our case, since we don't want to lose CNP control frames (congestion ACK), we will configure this type of traffic as strict priority (for tc6).

    switch (config) # interface ethernet 1/1 traffic-class 6 dcb ets strict

     

    Note: There is no need to change the scheduling options (WRR) for tc0 and tc3 (this is the default).

     

    Refer to Understanding TC Scheduling on Mellanox Spectrum Switches (WRR, SP) in case you wish to tune the scheduler with other WRR/SP options.

     

    9. Set ECN/RED on the switch.

    Configure ECN marking upon congestion for the RDMA traffic (tc3).

    In this example, we used minimum and maximum as absolute free space in the buffer's pool so that:

    When the queue length reaches 150KB, some packets will randomly be marked with congestion on the ECN bits on the IP header.

    When the queue length reaches 1500KB, all packets will be marked with congestion on the ECN bits on the IP header.

    switch (config) # interface ethernet 1/1 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

     

    Note: the values are in KB.

     

    For more information about ECN/RED refer to HowTo Configure ECN on Mellanox Ethernet Switches (Spectrum)

     

    Note: For each congested traffic packet, a CNP control packet is sent back to the sender to lower the rate.

     

    Host Configuration

    We need to make sure the the host is configured according to the following settings:

    • Enable ECN on the driver - to be able to send and accept CNP control frames to adjust the Tx rate
    • CNP control traffic to be sent with DSCP 48 - to be mapped to switch priority 6 on the switch
    • RDMA traffic to be sent with DCSP 24 - to be mapped to switch priority 3 on the switch
    • TCP traffic to be sent with DSCP 0 - to be mapped to switch priority 0 on the switch