How To Configure DSCP-Based PFC on Mellanox Spectrum Switches

Version 17

    This post describes how to enable enabling DSCP-based priority flow control (PFC) between Mellanox Spectrum switches. The DSCP field is used to classify the arriving packet in a specific priority.

    Normally, priority packet classification for PFC was available using the PCP field only in the packet header. Relying on the PCP field requires that a user configure VLANs.

     

    Currently Mellanox Spectrum switches (and soon Mellanox ConnectX NICs) allow users to enable priority flow control using the DSCP field. The advantage of relying on the DSCP field (compared to relying on the PCP field) is that the DSCP field is in the IP header. Because the DSCP field always exists in the packet, users are not required to configure VLANs.

     

    Command trust [L2|L3] is used in Mellanox Spectrum switch to define ifthe classification is based on PCP (L2) or DSCP (L3).

    Note: Additional options of the trust command exist: (trust port, trust both). For full documentation refer to the user manual.

     

    References

     

    Highlights

    • PFC is enabled globally on priority 4.
    • DSCP-based classification is used between Switch A and Switch B (trust L3).
    • PCP- based classification is used between the switches and the hosts (trust L2).
    • There is a lossless buffer configuration in the switch.
    • DSCP 32 mapping is set to switch-priority 4.
    • PCP 4 mapping is set to switch-priority 4.
    • Users need to create synthetic congestion on one link, which reduces the speed to 40G.
    • No VLAN is configured between the switches. VLAN 1 is configured between the switch and host. Note also that a router port (untagged) could be configured on those interfaces if the switches are enabled as IP routers.
    • Map ingress switch-priority 4  (for port 1/1 - no VLAN, no L2 PCP bits)  and Rewrite L2 PCP 4 on the egress (port 1/2) are specified.

     

     

    Setup

     

     

     

     

    Switch to Switch Configurations

    Perform the following configuration on both switches.

     

    1. Enable PFC in the switch using the lossless priority (for example, priority 4).

    switch (config) # dcb priority-flow-control enable force

    switch (config) # dcb priority-flow-control priority 4 enable

     

    2. Enable the PFC for each interface.

    switch (config) # interface ethernet 1/1 dcb priority-flow-control mode on force

     

    3. Set trust mode to L3.

    switch (config) #  interface ethernet 1/1 qos trust L3

     

    4. Map DSCP 32 to switch-priority 4 and DSCP 0 to switch-priority 0.

    switch (config) # interface ethernet 1/1 qos map dscp 32 to switch-priority 4
    switch (config) # interface ethernet 1/1 qos map dscp 0 to switch-priority 0

     

    5. Change the buffering allocation as described below:

    • Configure Pool0 and Pool1.
    • Set ingress and egress buffer reserved size and thresholds.
    • Set mapping between priority-group to switch-priority.

    switch (config) # pool ePool0 direction egress-mc size 4194304 type dynamic

    switch (config) # pool ePool1 direction egress size 4194304 type dynamic

    switch (config) # pool iPool0 direction ingress size 4194304 type dynamic

    switch (config) # pool iPool1 direction ingress size 4194304 type dynamic

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg4 map pool iPool1 type lossless reserved 70K xoff 17K xon 17K shared alpha 2

    switch (config)# interface ethernet 1/1 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    switch (config)# interface ethernet 1/1 egress-buffer ePort.tc4 map pool ePool1 reserved 1500 shared alpha inf

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg4 bind switch-priority 4

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg0 bind switch-priority 0

    More information about the buffering and QoS can be found here: HowTo Configure Mellanox Spectrum Switch for Lossless RoCE

     

    6. Set the port type to access (untagged). Set PVID to 1. This is also the default.

    switch (config) # interface ethernet 1/1 switchport mode access

    switch (config) # interface ethernet 1/1 switchport access vlan 1

     

    Note: For the router interface no VLAN is also an option if the switch acts as a router on that interface. Use the no switchport command to set the router interface.

     

    7. Make sure the DSCP priority is mapped to egress PCP in a proper mapping. In our case ingress DSCP 32 is mapped to egress PCP 4 (this is also the default).

    This is important when untagged traffic on interface 1/1 is switched to tagged traffic on interface 1/2. Make sure that the L2 PCP priority is kept on the egress.

     

    switch (config) # interface ethernet 1/1 qos rewrite map switch-priority 4 pcp 4 dei 0

     

     

     

    Verify Configuration

    1. Use the show buffers option to get details of the Ethernet 1/1 interface.

    switch (config) # show buffers details interfaces ethernet 1/1

    Flags: Y - Lossy, L - Lossless

           S - Static, D - Dynamic

    Shared size is in Bytes for static pool and in alphas for dynamic pool.

     

    Interface: Eth1/1

     

      Buffer        Resv    Xoff    Xon     Shared  Pool       Description

                    [Byte]  [Byte]  [Byte]  [%/a]             

      ------        ------  ------  ------  ------  ----       -----------

      iPort(Y)      0       -       -       inf     iPool0(D) 

      iPort(Y)      0       -       -       inf     iPool1(D) 

      iPort(Y)      0       -       -       inf     iPool2(D) 

      iPort(Y)      0       -       -       inf     iPool3(D) 

      iPort.pg0(Y)  20.1K   -       -       8       iPool0(D) 

      iPort.pg1(Y)  0       -       -       0       iPool0(D) 

      iPort.pg2(Y)  0       -       -       0       iPool0(D) 

      iPort.pg3(Y)  0       -       -       0       iPool0(D) 

      iPort.pg4(L)  70.0K   17.1K   17.1K   2       iPool1(D) 

      iPort.pg5(Y)  0       -       -       0       iPool0(D) 

      iPort.pg6(Y)  0       -       -       0       iPool0(D) 

      iPort.pg7(Y)  0       -       -       0       iPool0(D) 

      iPort.pg9(Y)  0       -       -       inf     iPool0(D)  Control

      ePort         0       -       -       inf     ePool0(D) 

      ePort         0       -       -       inf     ePool1(D) 

      ePort         0       -       -       inf     ePool2(D) 

      ePort         0       -       -       inf     ePool3(D) 

      ePort.tc0     1.5K    -       -       2       ePool0(D) 

      ePort.tc1     1.5K    -       -       2       ePool0(D) 

      ePort.tc2     1.5K    -       -       2       ePool0(D) 

      ePort.tc3     1.5K    -       -       2       ePool0(D) 

      ePort.tc4     1.5K    -       -       inf     ePool1(D) 

      ePort.tc5     1.5K    -       -       2       ePool0(D) 

      ePort.tc6     1.5K    -       -       2       ePool0(D) 

      ePort.tc7     1.5K    -       -       2       ePool0(D) 

      ePort.tc16    1.5K    -       -       inf     ePool0(D)  Control

     

      Switch-priority  Buffer

      ---------------  ------

      0                iPort.pg0

      1                iPort.pg0

      2                iPort.pg0

      3                iPort.pg0

      4                iPort.pg4

      5                iPort.pg0

      6                iPort.pg0

      7                iPort.pg0

     

    switch-47b2e0 [standalone: master] (config) #

     

    2. Show Quality of Service (QoS) on interface 1/1.

    • Check that Trust mode is L3.
    • Check that DSCP priority 32 is mapped to switch-priority 4.
    • Verify that switch priority 4 is mapped to L2 PCP 4.

    switch (config) # show qos interface ethernet 1/1

    Eth1/1

    Trust mode: L3

    Default switch-priority: 0

    Default PCP: 0

    Default DEI: 0

    PCP,DEI rewrite: disabled

    IP PCP,DEI rewrite: preserve (router is disabled)

    DSCP rewrite: disabled

     

    ...

     

    DSCP to switch-priority mapping:

    DSCP  switch-priority

    ----  ---------------

     

    ...

    31    3

    32    4

    33    4

      ...

     

    PCP,DEI rewrite mapping (switch-priority to PCP,DEI):

    switch-priority  PCP,DEI

    ---------------  -------

    0                0,0

    1                1,0

    2                2,0

    3                3,0

    4                4,0

    5                5,0

    6                6,0

    7                7,0

     

    ...

     

    3. Check the PFC configuration.

    switch (config) # show dcb priority-flow-control interface ethernet 1/1

     

    PFC enabled

    Priority Enabled List    :4

    Priority Disabled List   :0 1 2 3 5 6 7

     

    Interface      PFC admin        PFC oper

    ------------   --------------   -------------

    Eth1/1           On               Enabled

     

    4. Show the VLAN configuration.

    switch (config) # show interfaces switchport

    Interface       Mode         Access vlan        Allowed vlans

    ---------------------------------------------------------------------------------

    Eth1/1          access       1                 

    Eth1/2          access       1                 

    Eth1/3          access       1                 

    ...

     

    Switch to Host Configuration

    1. Configure PFC on the link to the host (interface 1/2).

    switch (config) # interface ethernet 1/2 dcb priority-flow-control mode on force

     

    2. Specify that the trust mode is needed. For example, use trust L2 (the default).

    switch (config) # interface ethernet 1/2 qos trust L2

     

    3. Map L2 priority 4 to switch-priority 4 and L2 priority 0 to switch priority 0 on the interface:

    switch (config)# interface ethernet 1/2 qos map pcp 4 dei 0 to switch-priority 4

    switch (config)# interface ethernet 1/2 qos map pcp 0 dei 0 to switch-priority 0

     

    4. Change the buffering allocation as described below:

    • Configure Pool0 and Pool1 (already done in the previous section).
    • Set the ingress and egress buffer reserved sizes and thresholds.
    • Set mapping between priority-group to switch-priority.

    switch (config)# interface ethernet 1/2 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8

    switch (config)# interface ethernet 1/2 ingress-buffer iPort.pg4 map pool iPool1 type lossless reserved 70K xoff 17K xon 17K shared alpha 2

    switch (config)# interface ethernet 1/2 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    switch (config)# interface ethernet 1/2 egress-buffer ePort.tc4 map pool ePool1 reserved 1500 shared alpha inf

    switch (config)# interface ethernet 1/2 ingress-buffer iPort.pg4 bind switch-priority 4

    switch (config)# interface ethernet 1/2 ingress-buffer iPort.pg0 bind switch-priority 0

     

    5. Set the port type to trunk.

    switch (config) # interface ethernet 1/2 switchport mode trunk

    switch (config) # interface ethernet 1/2 switchport trunk allowed-vlan all

     

    6. Make sure the DSCP priority is mapped to the egress PCP in a proper mapping. In our case ingress DSCP 32 is mapped to egress PCP 4 (this is also the default).

    When untagged traffic on interface 1/1 is switched to tagged traffic on interface 1/2, we need to make sure that the L2 PCP priority is kept on the egress.

    switch (config) # interface ethernet 1/2 qos rewrite map switch-priority 4 pcp 4 dei 0

     

    9. To create synthetic congestion reduce the speed of one port (for example Switch A to Server A), which lowers the speed to 40G.

    Run the following only on Switch A on the interface to the Host A (1/2)

    switch (config) # interface ethernet 1/2 speed 40G force

    Verify Configuration

    1. Use the show buffers option to get details about interface Ethernet 1/2.

    switch (config) # show buffers details interfaces ethernet 1/2

    Flags: Y - Lossy, L - Lossless

           S - Static, D - Dynamic

    Shared size is in Bytes for static pool and in alphas for dynamic pool.

     

    Interface: Eth1/1

     

      Buffer        Resv    Xoff    Xon     Shared  Pool       Description

                    [Byte]  [Byte]  [Byte]  [%/a]             

      ------        ------  ------  ------  ------  ----       -----------

      iPort(Y)      0       -       -       inf     iPool0(D) 

      iPort(Y)      0       -       -       inf     iPool1(D) 

      iPort(Y)      0       -       -       inf     iPool2(D) 

      iPort(Y)      0       -       -       inf     iPool3(D) 

      iPort.pg0(Y)  20.1K   -       -       8       iPool0(D) 

      iPort.pg1(Y)  0       -       -       0       iPool0(D) 

      iPort.pg2(Y)  0       -       -       0       iPool0(D) 

      iPort.pg3(Y)  0       -       -       0       iPool0(D) 

      iPort.pg4(L)  70.0K   17.1K   17.1K   2       iPool1(D) 

      iPort.pg5(Y)  0       -       -       0       iPool0(D) 

      iPort.pg6(Y)  0       -       -       0       iPool0(D) 

      iPort.pg7(Y)  0       -       -       0       iPool0(D) 

      iPort.pg9(Y)  0       -       -       inf     iPool0(D)  Control

      ePort         0       -       -       inf     ePool0(D) 

      ePort         0       -       -       inf     ePool1(D) 

      ePort         0       -       -       inf     ePool2(D) 

      ePort         0       -       -       inf     ePool3(D) 

      ePort.tc0     1.5K    -       -       2       ePool0(D) 

      ePort.tc1     1.5K    -       -       2       ePool0(D) 

      ePort.tc2     1.5K    -       -       2       ePool0(D) 

      ePort.tc3     1.5K    -       -       2       ePool0(D) 

      ePort.tc4     1.5K    -       -       inf     ePool1(D) 

      ePort.tc5     1.5K    -       -       2       ePool0(D) 

      ePort.tc6     1.5K    -       -       2       ePool0(D) 

      ePort.tc7     1.5K    -       -       2       ePool0(D) 

      ePort.tc16    1.5K    -       -       inf     ePool0(D)  Control

     

      Switch-priority  Buffer

      ---------------  ------

      0                iPort.pg0

      1                iPort.pg0

      2                iPort.pg0

      3                iPort.pg0

      4                iPort.pg4

      5                iPort.pg0

      6                iPort.pg0

      7                iPort.pg0

     

    switch-47b2e0 [standalone: master] (config) #

     

    2. Show the QoS on interface 1/1.

    switch (config) # show qos interface ethernet 1/2

    Eth1/2

    Trust mode: L2

    Default switch-priority: 0

    Default PCP: 0

    Default DEI: 0

    PCP,DEI rewrite: disabled

    IP PCP,DEI rewrite: preserve (router is disabled)

    DSCP rewrite: disabled

     

     

    PCP,DEI to switch-priority mapping:

    PCP,DEI  switch-priority

    -------  ---------------

    ...

    3,0      3

    4,0      4

    5,0      5

     

    ...

     

    PCP,DEI rewrite mapping (switch-priority to PCP,DEI):

    switch-priority  PCP,DEI

    ---------------  -------

    0                0,0

    1                1,0

    2                2,0

    3                3,0

    4                4,0

    5                5,0

    6                6,0

    7                7,0

    ...

     

    3. Show the PFC configuration.

    switch (config) # show dcb priority-flow-control

     

    PFC enabled

    Priority Enabled List    :4

    Priority Disabled List   :0 1 2 3 5 6 7

     

    Interface      PFC admin        PFC oper

    ------------   --------------   -------------

    Eth1/1           On               Enabled

    Eth1/2           On               Enabled

    Eth1/3           Disabled         Disabled

    ...

     

    4. Check the interface speed.

    switch (config) # show interfaces ethernet 1/2

     

    Eth1/2

      Admin state: Enabled

      Operational state: Down

      Last change in operational status: Never

      Description: N\A

      Mac address: 7c:fe:90:fb:82:7d  

      MTU: 1500 bytes(Maximum packet size 1522 bytes)

      Fec: auto

      Flow-control: receive off send off

      Actual speed: 40 Gbps            

      Width reduction mode: Not supported

      Switchport mode: access

      MAC learning mode: Enabled

      Last clearing of "show interface" counters : Never              

      60 seconds ingress rate: 0 bits/sec, 0 bytes/sec, 0 packets/sec

      60 seconds egress rate: 0 bits/sec, 0 bytes/sec, 0 packets/sec

    ...

     

    Switch A Configuration Example

    Review the following configuration for Switch A, adjust it if needed, and apply those settings to Switch A using CLI.

    dcb priority-flow-control enable force

    dcb priority-flow-control priority 4 enable

    pool ePool0 direction egress-mc size 4194304 type dynamic

    pool ePool1 direction egress size 16000000 type dynamic

    pool iPool0 direction ingress size 4194304 type dynamic

    pool iPool1 direction ingress size 4194304 type dynamic

     

    interface ethernet 1/1 dcb priority-flow-control mode on force

    interface ethernet 1/1 qos trust L3

    interface ethernet 1/1 qos map dscp 32 to switch-priority 4

    interface ethernet 1/1 qos map dscp 0 to switch-priority 0

    interface ethernet 1/1 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8

    interface ethernet 1/1 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2

    interface ethernet 1/1 egress-buffer ePort.tc4 map pool ePool1 reserved 1500 shared alpha inf

    interface ethernet 1/1 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    interface ethernet 1/1 ingress-buffer iPort.pg4 bind switch-priority 4

    interface ethernet 1/1 ingress-buffer iPort.pg0 bind switch-priority 0

    interface ethernet 1/1 switchport mode access

    interface ethernet 1/1 switchport access vlan 1

    interface ethernet 1/1 qos rewrite map switch-priority 4 pcp 4 dei 0

     

    interface ethernet 1/2 dcb priority-flow-control mode on force

    interface ethernet 1/2 qos trust L2

    interface ethernet 1/2 qos map pcp 4 dei 0 to switch-priority 4

    interface ethernet 1/2 qos map pcp 0 dei 0 to switch-priority 0

    interface ethernet 1/2 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8

    interface ethernet 1/2 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2

    interface ethernet 1/2 egress-buffer ePort.tc4 map pool ePool1 reserved 1500 shared alpha inf

    interface ethernet 1/1 egress-buffer ePort.tc0 map pool ePool0 reserved 1500 shared alpha 2

    interface ethernet 1/2 ingress-buffer iPort.pg4 bind switch-priority 4

    interface ethernet 1/1 ingress-buffer iPort.pg0 bind switch-priority 0

    interface ethernet 1/2 switchport mode trunk

    interface ethernet 1/2 switchport trunk allowed-vlan all

    interface ethernet 1/2 qos rewrite map switch-priority 4 pcp 4 dei 0

    interface ethernet 1/2 speed 40G

     

    Host Configuration

    1. Enable PFC on the host on VLAN 1 on priority 4.

    Refer to:

     

    2. Run RDMA or TCP traffic on VLAN 1 from Host A to Host B (100G > 40G), for example, use ib_send_bw. The traffic should be with L2 priority 4 and DSCP 32.

     

    3. Verify that there are no drops over the setup and check port counters. Note that the traffic rate should be close to 40G.