HowTo Run RoCE and TCP over L2 Enabled with PFC

Version 19

    This post is showing how to configure two flows over L2 Ethernet network enabled with PFC

    • RoCE  (Lossless L2 traffic)
    • TCP (Lossy L2 traffic)

     

    References


    Note: Make sure you have the latest MLNX-OFED installed (MLNX_EN is not enough)

     

    In the example, priority 3 is enabled and used for the RoCE application only. TCP will be sent over priority 0.

     

    Setup

    • 4x Hosts
    • 4x ConnectX-3, MLNX_OFED 2.1, RH6.4 (or latest)
    • 1x Switch SX1036 (or any other Mellanox Ethernet switch), MLNX-OS 3.3.4304 (or latest)

     

    Networks

    1. RoCE Network, VLAN100 (lossless) - 11.11.100.0
    2. TCP Network, VLAN200 (lossy) - 11.11.200.0

     

    Hosts functions

    1. 2x Application Server (connected to VLAN100 and VLAN200)
    2. 1x Web Server (Connected to VLAN200)
    3. 1x Storage backend server (Connected to VLAN100)

     

    1.png

     

     

     

     

    Network Flows

    1. Web -  App-1 (TCP) on VLAN200
    2. Web -  App-2 (TCP) on VLAN200
    3. App-1 -  Storage (RoCE) on VLAN100
    4. App-2 -  Storage (RoCE) on VLAN100

     

    Switch Configuration

     

     

    Create VLAN and set switchport hybrid (or trunk) mode:

     

    switch (config) # vlan 100

    switch (config vlan 100) # exit

    switch (config) # vlan 200

    switch (config vlan 200) # exit

    switch (config) # interface ethernet 1/1-1/4 switchport mode hybrid

    switch (config) # interface ethernet 1/1 switchport hybrid allowed-vlan all

    switch (config) # interface ethernet 1/2 switchport hybrid allowed-vlan all

    switch (config) # interface ethernet 1/3 switchport hybrid allowed-vlan all

    switch (config) # interface ethernet 1/4 switchport hybrid allowed-vlan all

                 

     

    Enable PFC:

     

    switch (config) # dcb priority-flow-control enable

    switch (config) # dcb priority-flow-control priority 3 enable

    switch (config) # interface ethernet 1/1-1/4 dcb priority-flow-control mode on force

    To verify the PFC configuration run:

    switch (config)# show dcb priority-flow-control

    PFC enabled

    Priority Enabled List   :3

    Priority Disabled List 
    0 1 2 4 5 6 7

    TC     Lossless

    ---    ----------

    0           N

    1           Y

    2           Y

    3           N

    Interface      PFC admin        PFC oper

    ------------  --------------   -------------

     

    1/1            On               Enabled

    1/2            On               Enabled

    1/3            On               Enabled

    1/4            On               Enabled

    switch (config) #

     

     

    Server Configuration

    Add the following line to the file: /etc/modprobe.d/mlx4_en.conf

    Note: 0x8 is 00001000b, which means that only priority 3 is enabled on that host.

    options mlx4_en pfctx=0x08 pfcrx=0x08

     

    Note: This command enables PFC on the host. The parameters pfctx (PFC TX) and pfcrx (PFC RX) are per host, if you have more than one card on the server all ports will be enabled with PFC (global pause will be disabled even if configured).

    The value is a bitmap of 8 bits = 8 priorities, if you wish to enable PFC on all priorities, you need to configure 0xff. The best practice would be enabling only lossless applications such as RoCE on specific priority.

    So if you wish to run on the server more than one flow type (e.g. TCP and RoCE) the best would be to turn on only 1 priority (e.g. priority 3) and for that you would configure the parameters “0x08” = 00001000b (binary), only the 4th bit is ON (starts with priority 0,1,2 and 3 -> 4th bit).

     

    Restart openidb

    #/etc/init.d/openidb restart

    To verify, Run:

    # RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX

    0x8

    Create a VLAN interface on each of the interfaces of the hosts, for example:

    # modprobe 8021q

    # vconfig add eth1 100

    # ifconfig eth1.100 11.11.100.1/24 up

    # vconfig add eth1 200

    # ifconfig eth1.200 11.11.200.1/24 up

     

     

    Set Egress priority of the VLAN:

    • VLAN100 (RoCE ) should egress with priority 3
    • VLAN200 (TCP) should egress with priority 0

    Note: The vconfig is a Linux command to configure egress priority per VLAN. This command applies to flows passing through the Kernel such as TCP/IP protocols and applications based on that.

    # for i in {0..7}; do vconfig set_egress_map eth1.100 $i 3 ; done

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    #

    # for i in {0..7}; do vconfig set_egress_map eth1.200 $i 0 ; done

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    Set egress mapping on device -:eth1.200:- Should be visible in /proc/net/vlan/eth1.200

    #

     

     

    Note: For RHEL/CentoOS 7 and above the vconfig command is obsolete, refer to HowTo Set Egress Priority VLAN on Linux for alternative configuration option.

     

    Map skb_prio to UP (User Priority):

    This is MLNX_OFED script to set egress priority for types of traffic such as RoCE the bypass the kernel stack. It maps kernel priority called “skb_prio” to user priority added in the VLAN tag.

    As you can see in the output, priority 0 is mapped for VLAN200 (TCP), while priority 3 is mapped for VLAN100 (RoCE).

    # tc_wrap.py -i eth1 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

    UP  0

            skprio: 0 (vlan 200)                        --> 

            skprio: 1 (vlan 200)                        -->

            skprio: 2 (vlan 200 tos: 8)                 -->

            skprio: 3 (vlan 200)                        -->  All this output is due to the set_egress_map command (TCP traffic) of VLAN 200 priority 0

            skprio: 4 (vlan 200 tos: 24)                -->

            skprio: 5 (vlan 200)                        -->

            skprio: 6 (vlan 200 tos: 16)                -->

            skprio: 7 (vlan 200)                        -->

    UP  1

    UP  2

    UP  3

            skprio: 0                                         -->

            skprio: 1                                         -->

            skprio: 2 (tos: 8)                                -->

            skprio: 3                                         -->

            skprio: 4 (tos: 24)                               -->

            skprio: 5                                         -->

            skprio: 6 (tos: 16)                               -->

            skprio: 7                                         --> All this output is due to the tc_wrap command for RoCE traffic

            skprio: 8                                         --> 16 skprio priorities is mapped to user priority 3

            skprio: 9                                         -->

            skprio: 10                                        -->

            skprio: 11                                        -->

            skprio: 12                                        -->

            skprio: 13                                        -->

            skprio: 14                                        -->

            skprio: 15                                        -->

            skprio: 0 (vlan 100)                      -->

            skprio: 1 (vlan 100)                      -->

            skprio: 2 (vlan 100 tos: 8)               -->

            skprio: 3 (vlan 100)                      -->

            skprio: 4 (vlan 100 tos: 24)              -->  All this output is due to the set_egress_map command (TCP traffic)of VLAN 200 priority 3

            skprio: 5 (vlan 100)                      -->

            skprio: 6 (vlan 100 tos: 16)              -->

            skprio: 7 (vlan 100)                      -->

    UP  4

    UP  5

    UP  6

    UP  7

    #

     

     

    Verification Procedure

     

    1. The following commands creates two RoCE flows to the storage node:

    // Run on storage node

    # ib_write_bw -R --report_gbits --port=12500 -D 10 & ib_write_bw -R --report_gbits --port=12510 -D 10

     

    // Run on app-1 host

    # ib_write_bw -R --report_gbits 11.11.100.50  --port=12500 -D 10

     

    // Run on app-2 host

    # ib_write_bw -R --report_gbits 11.11.100.50  --port=12510 -D 10

     

    2. To simulate several TCP flows to the web server, you can use netperf (or any other TCP application)

     

    // Run on web server

    # netserver &

     

    // Run on app-1 host

    # for I in {0..1} ; do ( netperf -H 11.11.200.50 -t TCP_STREAM -l 10 -P 0 -- -m 65536 -o throughput & ) ; done | awk '{SUM+=$1} END { print SUM}'

     

    // Run on app-2 host

    # for I in {0..1} ; do ( netperf -H 11.11.200.50 -t TCP_STREAM -l 10 -P 0 -- -m 65536 -o throughput & ) ; done | awk '{SUM+=$1} END { print SUM}'

     

     

    3. Read port priority counters (for priority 3) for the specific interface from the Mellanox switch (via MLNX-OS CLI):

    switch (config) # show interfaces ethernet 1/1 counters priority 3

     

    Rx

    333364              packets

    333364              unicast packets

    0                   multicast packets

    0                   broadcast packets

    362177148           bytes

    14814               pause packets

    8                   pause duration seconds

     

    Tx

    333371               packets

    333362               unicast packets

    6                    multicast packets

    3                    broadcast packets

    368845148            bytes

    0                    pause packets

     

    Read port priority counters from the application server:

    # ethtool -S eth1 | grep prio_3

         rx_prio_3_packets: 5152

         rx_prio_3_bytes: 424080
          tx_prio_3_packets: 328209

              tx_prio_3_bytes: 361752914

              rx_pause_prio_3: 14812
          rx_pause_duration_prio_3: 0
          rx_pause_transition_prio_3: 0

         tx_pause_prio_3: 0
          tx_pause_duration_prio_3: 47848
          tx_pause_transition_prio_3: 7406

     

    Note: You can try to run Netperf on VLAN100 and compare it to VLAN200 and see that in VLAN200 you send packets with priority 3, while in VLAN100 the packets are being sent with priority 0.

    1.png

     

    Advance QoS:

    In case you wish to have two priorities (let’s say priorities 3 and 5) on the same VLAN, and the RoCE application can be configured to use the sk_prio field, you can use the command as follows:

    For example: the application uses priority “3” and “5”, per sk_prio (1..16 kernel priority). Let's say the application uses sk_prio 3 for priority 3 and sk_prio 5 for priority 5, and all the rest is not used (priority 0)

     

    You should run the following command (only sk_prio 3 and 5 are mapped):

     

    #tc_wrap.py i- eth1 –u 0,0,0,3,0,5,0,0,0,0,0,0,0,0,0,0

        

    In this case the application will be able to select the proper priority for the application and the packet will egress with the proper priority bit on the VLAN tag.

    You may also want to configure the switches on the network to give higher bandwidth for each TC in the network (the priority is mapped to TC). Refer to HowTo Configure QoS on Mellanox Switches (SwitchX) for additional information about that.