How to Enable PFC on Mellanox Switches (Spectrum)

Version 11

    This post shows how to configure Mellanox switches on Spectrum-based switches.

     

    References

     

    Overview

    This post demonstrates how to enable Priority Flow Control (PFC) on priority 4 on Mellanox Spectrum SN2700 on two ports (1/1 and 1/2).

     

    Configuration

    1. Create a VLAN, set a switchport in trunk mode, and run:

    switch (config) # vlan 100

    switch (config vlan 100) # exit

    switch (config) # interface ethernet 1/1 switchport mode trunk

    switch (config) # interface ethernet 1/2 switchport mode trunk

     

    2. Make sure Flow Control is disabled (it is disabled by default) by running:

    switch (config) # interface ethernet 1/1-1/2 flowcontrol send off force

    switch (config) # interface ethernet 1/1-1/2 flowcontrol receive off force

     

    3. Enable PFC on the desired priority by running:

    switch (config) # dcb priority-flow-control enable

    This action might cause traffic loss while shutting down a port with priority-flow-control mode on

    Type 'yes' to confirm  enable pfc globally: yes

    switch (config) # dcb priority-flow-control priority 4 enable
    switch (config) # interface ethernet 1/1 dcb priority-flow-control mode on force

    switch (config) # interface ethernet 1/2 dcb priority-flow-control mode on force

     

    4. Change the buffering configuration as described below.

     

    The summary of the changes over the defaults are as follows:

      • Ingress
        • Default reserved Port Group (PG) buffer size is reduced, for lossy traffic.
        • Reduced size is allocated for the new driver for lossless traffic.
      • Egress
        • TC4 shared alpha is configured as infinite, as recommended for TCs with lossless traffic.
      • Mapping switch-priority 4 to a PG buffer 4.

     

    a. Reduce default PG size

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8


    b. Setting lossless ingress buffer PG4 and lossless egress TC3

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2
    switch (config)# interface ethernet 1/1 egress-buffer  ePort.tc4 map pool ePool0 reserved 1500 shared alpha inf

     

    c. Map of switch priority 4 to lossless ingress PG4 buffer

    switch (config)# interface ethernet 1/1 ingress-buffer iPort.pg4 bind switch-priority 4

     

    Verification

    1. Verify that the PFC configuration is correct by running:

    switch (config)# show dcb priority-flow-control

    PFC enabled

    Priority Enabled List   :4

    Priority Disabled List  0 1 2 3 5 6 7

     

    Interface      PFC admin        PFC oper

    ------------  --------------   -------------

    1/1            On               Enabled

    1/2            On               Enabled

    switch (config) #

     

    2. Check the buffer configuration per port. Make sure that ingress port group 4 (iPort.pg4) is configured as expected and that the switch priority is mapped to this group:

    switch  (config) # show buffers details interfaces ethernet 1/1

    Flags: Y - Lossy, L - Lossless

           S - Static, D - Dynamic

    Shared size is in Bytes for static pool and in alphas for dynamic pool.

     

    Interface: Eth1/1

     

      Buffer        Resv    Xoff    Xon     Shared  Pool       Description

                    [Byte]  [Byte]  [Byte]  [%/a]

      ------        ------  ------  ------  ------  ----       -----------

      iPort(Y)      0       -       -       inf     iPool0(D)

      iPort(Y)      0       -       -       0       iPool1(D)

      iPort(Y)      0       -       -       0       iPool2(D)

      iPort(Y)      0       -       -       0       iPool3(D)

      iPort.pg0(Y)  20.0K   -       -       inf     iPool0(D)   Data      --> Lossy

      iPort.pg1(Y)  0       -       -       0       iPool0(D)

      iPort.pg2(Y)  0       -       -       0       iPool0(D)

      iPort.pg3(Y)  0       -       -       0       iPool0(D)

      iPort.pg4(L)  70K   16.7K   16.7K     2       iPool0(D)            --> Lossless

      iPort.pg5(Y)  0       -       -       0       iPool0(D)

      iPort.pg6(Y)  0       -       -       0       iPool0(D)

      iPort.pg7(Y)  0       -       -       0       iPool0(D)

      iPort.pg9(Y)  20.0K   -       -       inf     iPool0(D)   Control

      ePort         0       -       -       inf     ePool0(D)

      ePort         0       -       -       inf     ePool1(D)

      ePort         0       -       -       inf     ePool2(D)

      ePort         0       -       -       inf     ePool3(D)

      ePort.tc0     1.5K    -       -       2       ePool0(D)

      ePort.tc1     1.5K    -       -       2       ePool0(D)

      ePort.tc2     1.5K    -       -       2       ePool0(D)

      ePort.tc3     1.5K    -       -       2       ePool0(D)

      ePort.tc4     1.5K    -       -       inf     ePool0(D)             --> Lossless

      ePort.tc5     1.5K    -       -       2       ePool0(D)

      ePort.tc6     1.5K    -       -       2       ePool0(D)

      ePort.tc7     1.5K    -       -       2       ePool0(D)

      ePort.tc16    96      -       -       inf     ePool0(D)  Control

     

      Switch-priority  Buffer

      ---------------  ------

      0                iPort.pg0

      1                iPort.pg0

      2                iPort.pg0

      3                iPort.pg0

      4                iPort.pg4

      5                iPort.pg0

      6                iPort.pg0

      7                iPort.pg0

     

    3. Run traffic between the servers using this priority, for example use ib_write_bw to send RoCE traffic.

    Check the usage level (usage column) of ingress port group 4, in this case you can see that it is 4.1K.

     

    4. Check port priority counters.

    Make sure that pause packets are being sent or received with the proper priority. In typical cases you can expect pause packets to be sent from the receiver side via the switch back to the sender side.

    In this example, Server 1 is the sender and is connected via switch port number 1/1, while Server 2 is the receiver and is connected via switch port number 1/2.

    You can see that Server 2 is sending pause frames to the switch, and that the switch passes them via Port 1 to Server 1.

    switch (config) #  show interfaces ethernet 1/1 counters priority 4

     

    Rx

      0                    packets

      0                    unicast packets

      0                    multicast packets

      0                    broadcast packets

      2450879446344        bytes

      0                    pause packets

      0                    pause duration milliseconds

     

    Tx

      0                    packets

      0                    unicast packets

      0                    multicast packets

      0                    broadcast packets

      4426111528           bytes

    19907096             pause packets  <--- Expect pauses to be sent to the sender server (Server 1), those pauses are populated from port 1/2

     

    switch (config) #  show interfaces ethernet 1/2 counters priority 4

     

    Rx

      0                    packets

      0                    unicast packets

      0                    multicast packets

      0                    broadcast packets

      4426111528           bytes

      14454050             pause packets  <--- Expect pauses to be received from the receiver server (Server 2)

      7817                 pause duration milliseconds

     

    Tx

      0                    packets

      0                    unicast packets

      0                    multicast packets

      0                    broadcast packets

      2450879446344        bytes

      0                    pause packets

     

    Note: For MLNX-OS 3.5.1002, not all port priority counters are supported.

     

    5. Check port counters and verify that there are no drops (Rx discarded packets).

    switch (config) # show interfaces ethernet 1/1

     

    Eth1/1

      Admin state: Enabled

      Operational state: Up

      Description: N\A

      Mac address: 7c:fe:90:eb:71:22

      MTU: 9216 bytes(Maximum packet size 9238 bytes)

      Fec: auto

      Flow-control: receive off send off

      Actual speed: 100 Gbps

      Width reduction mode: Not supported

      Switchport mode: trunk

      MAC learning mode: Enabled

      Last clearing of "show interface" counters : Never

      60 seconds ingress rate: 183238616 bits/sec, 22904827 bytes/sec, 279318 packets/sec

      60 seconds egress rate: 99511589040 bits/sec, 12438948630 bytes/sec, 2979390 packets/sec

     

    Rx

      163590266            packets

      163590204            unicast packets

      53                   multicast packets

      9                    broadcast packets

      14341363408          bytes

      0                    error packets

      0                    discard packets

     

    Tx

      1787005034           packets

      1787001201           unicast packets

      3827                 multicast packets

      6                    broadcast packets

      7460729568324        bytes

      0                    discard packets