End-to-End QoS Configuration for Mellanox Switches (SwitchX) and Adapters

Version 19

    This post is discussing end-to-end QoS considerations and configuration Mellanox SwitchX based Ethernet switches and ConnectX-3 Pro adapters.

     

    Note: Newer procedure using Mellanox Spectrum Switch and ConnectX-4 adapters, can be found here:  HowTo Configure Lossless RoCE (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (with QoS)

     

    References

     

    High level

    • The example given in this post assumes 3 networks: Management, Services, Storage.
    • The requirement is to have different QoS bandwidth for each network for example:
      • Management 10% of the bandwidth
      • Services 40% bandwidth
      • Storage 50% bandwidth
    • This examples assumes there is no RoCE application running in the network, only TCP/IP or similar applications that pass the linux kernel.
    • It is possible to run several priorities on the same VLAN. However, this example assumes that each VLAN can be mapped with a specific priority (only one)

     

    Setup

    • Configure a small lab in a non-blocking way. The minimum would be one switch and several servers
    • Networks (example)
      • Management network - VLAN 10
      • Service network  - VLAN 30
      • Storage network - VLAN 50

     

    How do I achieve QoS End-to-End?

    • Each host should be configured to add L2 priority (PCP) bits to each VLAN used by the host
    • Mellanox switches have 4 TCs (Traffic classes), while means that 4 QoS levels could be used on the network when using Mellanox switches.
    • L2 QoS uses the priority bits (PCP) in the VLAN tag field, while the switch map each priority to a specific TC.

     

    Host configuration

    VLAN mapping

     

    Configure the relevant VLANs on the server and map the egress priority:

    For example:

         - map management VLAN 10 to priority 1

         - map services VLAN 30 to priority 3

         - map storage VLAN 50 to priority 5

    You could do that using the vconfig set_egress_map command if the applications you run doesn't by-pass the kernel. otherwise you need to use the tc_wrap command (see some examples HowTo Run RoCE and TCP over L2 Enabled with PFC)

    You need to run this example for all networks (VLAN 10, 30 and 50).

    (Note: more than one VLAN can be mapped to the same priority.)

    # for i in {0..7}; do vconfig set_egress_map eth1.10 $i 1 ; done

    # for i in {0..7}; do vconfig set_egress_map eth1.30 $i 3 ; done

    # for i in {0..7}; do vconfig set_egress_map eth1.50 $i 5 ; done

    Once this is done, each packet will egress with the relevant priority (1,3 or 5).

     

    QoS configuration

    All related QoS modification can be done via mlnx_qos command.

    you can check the current configuration of the port by running the command as follows :

    # mlnx_qos -i eth2

    tc: 0 ratelimit: unlimited, tsa: strict

             up:  0

                     skprio: 0

                     skprio: 1

                     skprio: 2 (tos: 8)

                     skprio: 3

                     skprio: 4 (tos: 24)

                     skprio: 5

                     skprio: 6 (tos: 16)

                     skprio: 7

                     skprio: 8

                     skprio: 9

                     skprio: 10

                     skprio: 11

                     skprio: 12

                     skprio: 13

                     skprio: 14

                     skprio: 15

             up:  1

                    skprio: 0 (vlan 10)

                    skprio: 1 (vlan 10)

                    skprio: 2 (vlan 10 tos: 8)

                    skprio: 3 (vlan 10)

                    skprio: 4 (vlan 10 tos: 24)

                    skprio: 5 (vlan 10)

                    skprio: 6 (vlan 10 tos: 16)

                    skprio: 7 (vlan 10)

     

             up:  2

             up:  3

                     skprio: 0 (vlan 30)

                     skprio: 1 (vlan 30)

                     skprio: 2 (vlan 30 tos: 8)

                     skprio: 3 (vlan 30)

                     skprio: 4 (vlan 30 tos: 24)

                     skprio: 5 (vlan 30)

                     skprio: 6 (vlan 30 tos: 16)

                     skprio: 7 (vlan 30)

             up:  4

             up:  5

                     skprio: 0 (vlan 50)

                     skprio: 1 (vlan 50)

                     skprio: 2 (vlan 50 tos: 8)

                     skprio: 3 (vlan 50)

                     skprio: 4 (vlan 50 tos: 24)

                     skprio: 5 (vlan 50)

                     skprio: 6 (vlan 50 tos: 16)

                     skprio: 7 (vlan 50)

             up:  6

             up:  7

    #   

      

     

    In case there are no VLANs, by default there is only one TC (Traffic Class) used at the host (TC0).

    Note: the TC configuration is only relevant for the Tx traffic egress from the adapter card.

    In this example, we have three QoS levels, and for that we need 3 QoS TCs.

     

    The mlnx_qos command has the following parameter options:

     

    -p : maps priority (UP) to TC, in this example we can map priority i to TCi (for example, priority 1 to TC 1, priority 3 to TC 3 and 5 for TC 5). "-s 0,1,0,3,0,5,0,0" will do the job.

    -s: stands for transmission algorithm. Ether "ets" or "strict" via a list. In our example, we will use map all TC as ETS  "-s ets,ets,ets,ets,ets,ets,ets,ets" .
    -t : Set the weighted bandwidth for ETS TCs (not relevant for TCs marked with "strict" transmission mode). In this example, it will be:   "-t 0,10,0,40,0,50,0,0" . note, the sum of the list must be equal to 100 (percent).

    -r: stands for rate-limiting (in Gb/s). it is possible to perform rate limiting on each TC, not to exclude a specific rate. In this example for 10GbE cards, you can use "-r 0,1,0,4,0,5,0,0"

     

    Here is the full run of this command:

    # mlnx_qos -i eth2 -p 0,1,0,3,0,5,0,0 -s ets,ets,ets,ets,ets,ets,ets,ets -t 0,10,0,40,0,50,0,0 -r 0,1,0,4,0,5,0,0

    tc: 0 ratelimit: unlimited, tsa: ets, bw: 0%

             up:  0

                     skprio: 0

                     skprio: 1

                     skprio: 2 (tos: 8)

                     skprio: 3

                     skprio: 4 (tos: 24)

                     skprio: 5

                     skprio: 6 (tos: 16)

                     skprio: 7

                     skprio: 8

                     skprio: 9

                     skprio: 10

                     skprio: 11

                     skprio: 12

                     skprio: 13

                     skprio: 14

                     skprio: 15

             up:  2

             up:  4

             up:  6

             up:  7

    tc: 1 ratelimit: 1 Gbps, tsa: ets, bw: 10%

             up:  1

                     skprio: 0 (vlan 10)

                     skprio: 1 (vlan 10)

                     skprio: 2 (vlan 10 tos: 8)

                     skprio: 3 (vlan 10)

                     skprio: 4 (vlan 10 tos: 24)

                     skprio: 5 (vlan 10)

                     skprio: 6 (vlan 10 tos: 16)

                     skprio: 7 (vlan 10)

    tc: 3 ratelimit: 4 Gbps, tsa: ets, bw: 40%

             up:  3

                     skprio: 0 (vlan 30)

                     skprio: 1 (vlan 30)

                     skprio: 2 (vlan 30 tos: 8)

                     skprio: 3 (vlan 30)

                     skprio: 4 (vlan 30 tos: 24)

                     skprio: 5 (vlan 30)

                     skprio: 6 (vlan 30 tos: 16)

                     skprio: 7 (vlan 30)

    tc: 5 ratelimit: 5 Gbps, tsa: ets, bw: 50%

             up:  5

                     skprio: 0 (vlan 50)

                     skprio: 1 (vlan 50)

                     skprio: 2 (vlan 50 tos: 8)

                     skprio: 3 (vlan 50)

                     skprio: 4 (vlan 50 tos: 24)

                     skprio: 5 (vlan 50)

                     skprio: 6 (vlan 50 tos: 16)

                     skprio: 7 (vlan 50)

    #

    Important note:

    This command is dynamic and not static, so for each server reset or driver reset the setting will be deleted. Hence, it is recommended to add the command to a startup script or similar.

     

    Switch Configuration (SwitchX)

    On the switch, you need to make sure that each priority is mapped to different TC (by default priority 0,1 is mapped to TC 0, priority 2,3 is mapped to TC1, 4 and 5 is mapped to TC2 while 6,7 is mapped to TC3). So if you keep it with the default in this case, you are fine.

    • Need to make sure  ETS is enabled (it is the default)
    • Need to give different weight for each TC, as you need
      • Management TC (TC0 based on priority 1) -  10%
      • Services TC (TC1 based on priority 3) – 40%
      • Storage TC (TC2 based on priority 5) – 50%

     

    To Enable QoS globally run the commands:

    switch (config) # dcb priority-flow-control enable force 

     

    Then you need to enable QoS on the required interfaces, for example:

    switch (config interface etherent 1/1) # dcb priority-flow-control mode on force

     

    To change the ETS bandwidth configuration run:

    switch (config) # dcb ets tc bandwidth 10 40 50 0


    Additional information on the switch configuration can be found in HowTo Configure QoS on Mellanox Switches (SwitchX).