HowTo Configure PFC on ConnectX-4

Version 11

    This post describes the procedure used to configure Priority Flow Control (PFC) on ConnectX-4 (mlx5) drivers.

    This feature requires MLNX_OFED version 3.3, or later.

     

    References

    • MLNX_OFED Users Manual

     

    Overview

    PFC can be enabled on ConnectX-4 drivers in two ways:

    1. By enabling PFC locally, without the usage of Link Layer Discovery Protocol (LLDP) and Data Center Bridging Capabilities Exchange Protocol (DCBX).

    2. By allowing the switch to be connected to the host, which is configured for PFC using LLDP and DCBX.

     

    Note: The ConnectX-4 (mlx5) configuration process is different from that which is used for ConnectX-3 (mlx4). For ConnectX-3 configuration instructions, refer to HowTo Run RoCE over L2 Enabled with PFC for examples.

     

    In order to make PFC work, configure the switch as follows:

    1. Enable PFC on one of more priorities on the firmware. Use the mlnx_qos tool.

    2. Create a VLAN interface and specify that egress mapping should be used.

    3. Enable PFC on those priorities for the switch.

     

    Setup

    Establish two servers equipped with ConnectX-4, and connect them to a switch that supports PFC.

     

    Configuration

    For this example, we will enable PFC on priority 4.

     

    Enable PFC on the firmware

    1. Use the mlnx_qos tool (with the -h option) to capture some information about the possible configuration.

     

    Note: This option is available only on MLNX_OFED version 3.3 and later only.

     

    In the example, we use flag -f or --pfc.

    # mlnx_qos -h

    Usage: mlnx_qos -i <interface> [options]

    Options:

      --version                 show program's version number and exit

      -h, --help                show this help message and exit

      -f LIST, --pfc=LIST       Set priority flow control for each priority. LIST is comma separated value for each priority starting from 0 to 7. Example: 0,0,0,0,1,1,1,1 enable PFC on TC4-7

      -p LIST, --prio_tc=LIST   maps UPs to TCs. LIST is 8 comma seperated TC numbers. Example: 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and UPs 4-7 to TC1

      -s LIST, --tsa=LIST       Transmission algorithm for each TC. LIST is comma seperated algorithm names for each TC. Possible algorithms: strict, etc. Example: ets,strict,ets sets TC0,TC2 to ETS and TC1 to strict. The rest are unchanged.

      -t LIST, --tcbw=LIST      Set minimal guaranteed %BW for ETS TCs. LIST is comma seperated percents for each TC. Values set to TCs that are not configured to ETS algorithm are ignored, but must be present. Example: if TC0,TC2 are set to ETS,                                     then 10,0,90 will set TC0 to 10% and TC2 to 90%. Percents must sum to 100.

      -r LIST, --ratelimit=LIST Rate limit for TCs (in Gbps). LIST is a comma seperated Gbps limit for each TC. Example: 1,8,8 will limit TC0 to 1Gbps, and TC1,TC2 to 8 Gbps each.

      -i INTF, --interface=INTF Interface name

      -a                        Show all interface's TCs

     

    2. Display the interface PFC configuration as shown below. The output shows that none of the priorities are enabled (they are shown as all zeroes).

    # mlnx_qos -i eth35

    PFC configuration:

      priority    0   1   2   3   4   5   6   7

      enabled     0   0   0   0   0   0   0   0  

     

    tc: 0 ratelimit: unlimited, tsa: vendor

      priority:  1

    tc: 1 ratelimit: unlimited, tsa: vendor

      priority:  0

      skprio: 0

      skprio: 1

      skprio: 2 (tos: 8)

      skprio: 3

      skprio: 4 (tos: 24)

      skprio: 5

      skprio: 6 (tos: 16)

      skprio: 7

      skprio: 8

      skprio: 9

      skprio: 10

      skprio: 11

      skprio: 12

      skprio: 13

      skprio: 14

      skprio: 15

    tc: 2 ratelimit: unlimited, tsa: vendor

      priority:  2

    tc: 3 ratelimit: unlimited, tsa: vendor

      priority:  3

    tc: 4 ratelimit: unlimited, tsa: vendor

      priority:  4

    tc: 5 ratelimit: unlimited, tsa: vendor

      priority:  5

    tc: 6 ratelimit: unlimited, tsa: vendor

      priority:  6

    tc: 7 ratelimit: unlimited, tsa: vendor

      priority:  7

     

    3. Set Priority 4, as shown in this example, and enable it with PFC.

    # mlnx_qos -i eth35 --pfc 0,0,0,0,1,0,0,0

    PFC configuration:

      priority    0   1   2   3   4   5   6   7

      enabled    0   0   0   0   1   0   0   0  

     

    tc: 0 ratelimit: unlimited, tsa: vendor

      priority:  1

    tc: 1 ratelimit: unlimited, tsa: vendor

      priority:  0

      skprio: 0

      skprio: 1

      skprio: 2 (tos: 8)

      skprio: 3

      skprio: 4 (tos: 24)

      skprio: 5

      skprio: 6 (tos: 16)

      skprio: 7

      skprio: 8

      skprio: 9

      skprio: 10

      skprio: 11

      skprio: 12

      skprio: 13

      skprio: 14

      skprio: 15

    tc: 2 ratelimit: unlimited, tsa: vendor

      priority:  2

    tc: 3 ratelimit: unlimited, tsa: vendor

      priority:  3

    tc: 4 ratelimit: unlimited, tsa: vendor

      priority:  4

    tc: 5 ratelimit: unlimited, tsa: vendor

      priority:  5

    tc: 6 ratelimit: unlimited, tsa: vendor

      priority:  6

    tc: 7 ratelimit: unlimited, tsa: vendor

      priority:  7

    [root@mtibiz12-l bin]#

     

    Note: This configuration survives a driver restart.

    Create a VLAN Interface and Set Egress Mapping

    Follow the procedure described in  HowTo Set Egress Priority VLAN on Linux, or use the vconfig option to set egress mapping (refer to the examples in HowTo Run RoCE over L2 Enabled with PFC).

     

    Run the tc_wrap command to verify that the UP (priority) 4 is mapped to the VLAN.

    Note: At this point, only traffic that passes the kernel will be set with priority 4. RDMA over Converged Ethernet (RoCE) traffic will not be set because it bypasses the kernel.

    # tc_wrap.py -i eth35

    UP  0

      skprio: 0

      skprio: 1

      skprio: 2 (tos: 8)

      skprio: 3

      skprio: 4 (tos: 24)

      skprio: 5

      skprio: 6 (tos: 16)

      skprio: 7

      skprio: 8

      skprio: 9

      skprio: 10

      skprio: 11

      skprio: 12

      skprio: 13

      skprio: 14

      skprio: 15

    UP  1

    UP  2

    UP  3

    UP  4

      skprio: 0 (vlan 100)

      skprio: 1 (vlan 100)

      skprio: 2 (vlan 100 tos: 8)

      skprio: 3 (vlan 100)

      skprio: 4 (vlan 100 tos: 24)

      skprio: 5 (vlan 100)

      skprio: 6 (vlan 100 tos: 16)

      skprio: 7 (vlan 100)

    UP  5

    UP  6

    UP  7

     

    Set Egress Mapping on Kernel Bypass Traffic (RoCE)

    Use the tc_wrap command to set the required priority. In this case, we mapped all skprio (kernel priority) to L2 priority 4.

     

    # tc_wrap.py -i eth35 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4

    UP  0

    UP  1

    UP  2

    UP  3

    UP  4

      skprio: 0

      skprio: 1

      skprio: 2 (tos: 8)

      skprio: 3

      skprio: 4 (tos: 24)

      skprio: 5

      skprio: 6 (tos: 16)

      skprio: 7

      skprio: 8

      skprio: 9

      skprio: 10

      skprio: 11

      skprio: 12

      skprio: 13

      skprio: 14

      skprio: 15

      skprio: 0 (vlan 100)

      skprio: 1 (vlan 100)

      skprio: 2 (vlan 100 tos: 8)

      skprio: 3 (vlan 100)

      skprio: 4 (vlan 100 tos: 24)

      skprio: 5 (vlan 100)

      skprio: 6 (vlan 100 tos: 16)

      skprio: 7 (vlan 100)

    UP  5

    UP  6

    UP  7

    [root@mti-mar-s6 qos]#

    For more examples about using the tc_wrap command, refer to the following posts:

     

    Verification

     

    In order to test RoCE traffic, use the pefrtest package as follows:

    1. Make sure to align the MTU on the adapter and on the switch to the same value. To ensure high throughput, set the MTU on the adapter and the switch to 9000.

     

    2. Use the perftest package (e.g. ib_write_bw) with the following flags:

     

    • -x    Use this flag to indicate a specific gid should be used (RoCEv1, RoCEv2) per the required interface. In order to find the gid, check the script as described in Understanding show_gids Script .
    • -S    Use this flag to specify that a certain priority is requested, in our case 4.

     

    Use other flags as needed.

     

    For example:

    Run the following command with flags on the server:

    # ib_write_bw --report_gbits -D5 -d mlx5_1   -F -x 6 -S 4

     

    ************************************

    * Waiting for client to connect... *

    ************************************

     

    Run the following command with flags on the client:

    # ib_write_bw --report_gbits -D5 -d mlx5_1   -F  13.13.13.6 -x 6 -S 4

    ---------------------------------------------------------------------------------------

                        RDMA_Write BW Test

    Dual-port       : OFF Device         : mlx5_1

    Number of qps   : 1 Transport type : IB

    Connection type : RC Using SRQ      : OFF

    TX depth        : 128

    CQ Moderation   : 100

    Mtu             : 4096[B]

    Link type       : Ethernet

    Gid index       : 6

    Max inline data : 0[B]

    rdma_cm QPs : OFF

    Data ex. method : Ethernet

    ---------------------------------------------------------------------------------------

    local address: LID 0000 QPN 0x01e5 PSN 0x3df409 RKey 0x005a00 VAddr 0x007f786a310000

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:13:13:13:05

    remote address: LID 0000 QPN 0x01f1 PSN 0xef0b8e RKey 0x012fa6 VAddr 0x007f846f710000

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:13:13:13:06

    ---------------------------------------------------------------------------------------

    #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

    65536      223500           0.00               39.06     0.074493

    ---------------------------------------------------------------------------------------

     

    Switch Port Priority Counters

    In this example, information about 40GbE links using a SX1700 switch was captured.

     

    In case you are running over other switches, refer to the vendor user manual, see here RDMA/RoCE Solutions for some more procedures.

     

    If you need to run the MLNX-OS, check that the priority counters (traffic and pause) are running as expected, and that pause is populated from one port to the other.

     

    In the example that follows, pause was received in port 1/16 (Rx)  and populated to port 1/15 (Tx).

    # show interfaces ethernet 1/16 counters priority 4

     

    Rx

      10537402             packets

      10537402             unicast packets

      0                    multicast packets

      0                    broadcast packets

      864067556            bytes

      183                  pause packets

      0                    pause duration milliseconds

     

    Tx

      119667486            packets

      119667486            unicast packets

      0                    multicast packets

      0                    broadcast packets

      502005101516         bytes

      0                    pause packets

     

    # show interfaces ethernet 1/15 counters priority 4

     

    Rx

      119667504            packets

      119667504            unicast packets

      0                    multicast packets

      0                    broadcast packets

      499611830110         bytes

      0                    pause packets

      0                    pause duration milliseconds

     

    Tx

      10537403             packets

      10537403             unicast packets

      0                    multicast packets

      0                    broadcast packets

      1074815616           bytes

      16                   pause packets

     

    Adapter Port priority counters

    MLNX_OFED driver supports several ingress and egress counters per priority.

    Run ethtool -S to get the full list port counters.

     

    Here is the list of port priority counters (per port per priority)

     

    Rx Counters:

    • Octets
    • Frames
    • Pause frames
    • Pause duration (in micro seconds)
    • Pause transition (Counts the number of transitions from Xoff to Xon)

     

    Tx Counters:

    • Octets
    • Frames
    • Pause frames
    • Pause duration (in micro seconds)

     

    For example:

    # ethtool -S eth35 | grep prio4

         prio4_rx_octets: 62147780800

         prio4_rx_frames: 14885696

         prio4_tx_octets: 0

         prio4_tx_frames: 0

         prio4_rx_pause: 0

         prio4_rx_pause_duration: 0

         prio4_tx_pause: 26832

         prio4_tx_pause_duration: 14508

         prio4_rx_pause_transition: 0

     

    Note: The pause priority counters are visible only when PFC is enabled, otherwise, only the traffic counters (per priority) are visible.

     

     

    Troubleshooting

    1. If you are using MLNX_OFED version 3.2 , you can download the newer script from here:

    # cd /usr/bin

    # mv mlnx_qos mlnx_qos.old

    # wget http://www.mellanox.com/downloads/solutions/temp/qos/mlnx_qos .

    # wget http://www.mellanox.com/downloads/solutions/temp/qos/dcbnetlink.py .

    # chmod 755 mlnx_qos