HowTo Run RoCE and TCP over L2 Enabled with PFC (2016)

Version 9

    This post shows how to configure two flows over L2 Ethernet network enabled with Priority Flow Control (PFC).

    • RDMA over Converged Ethernet (RoCE)  (Lossless L2 traffic)
    • Transmission Control Protocol (TCP) (Lossy L2 traffic)

     

    References

     

    Note: Make sure you have the latest version of MLNX-OFED installed (MLNX_EN does not support all of the features you will need).

     

    In the example, priority 4 is enabled and used for the RoCE application only. TCP will be sent over priority 0.

     

    Overview

     

    Setup

    • 2x Hosts (1x ConnectX-3 40G, 1x ConnectX-4 50G)
    • 1x Cisco C3232C Switch
      • port 1/20 on the switch is connected to port eth35 (connectX-3), server interface VLAN IP 12.12.12.9
      • port 1/22/1 on the switch is connected to port eth2 (ConnectX-4), server interface VLAN IP 12.12.12.6

     

    Networks

    In this example, we will use one network, VLAN 100, for all traffic. Each host runs two traffic flows.

     

    Traffic Flows:

    1. TCP Flow - iperf for example (pass via the kernel)
    2. RoCE Flow (bypass the kernel)

     

    Hosts Functions

    1. Host with ConnectX-3 is a compute node and storage client (40G)
    2. Host with ConnectX-4 is the storage server (50G using a split port on the Cisco switch)

     

    Switch Configuration

     

    Refer to HowTo Configure PFC on Cisco 3K C3232C (NX-OS 7.X), for detailed explanations.

    Here is the running config:

    version 7.0(3)I3(1)

    interface breakout module 1 port 22 map 50g-2x

     

    # Network QoS Configuration

    class-map type network-qos pfc_class

    match qos-group 4

     

    policy-map type network-qos pfc_policy

      class type network-qos pfc_class

        pause pfc-cos 4

        mtu 9000

     

    system qos

      service-policy type network-qos pfc_policy

     

     

    # QoS Configuration

    class-map type qos match-all qos_class

      match cos 4

    policy-map type qos qos_policy   # The mapping and activation of the policy is per interface

      class qos_class

        set qos-group 4

        set cos 4

     

     

    vlan 1,100

     

    interface Ethernet1/20

      switchport mode trunk

      priority-flow-control mode on

      mtu 9000

      service-policy type qos input qos_policy   # QoS Policy mapping (per interface)

     

    interface Ethernet1/22/1

      switchport mode trunk

      priority-flow-control mode on

      mtu 9000

      service-policy type qos input qos_policy

     

    Server Configuration

     

    Follow these steps to configure the host:

    1. Enable PFC on priority 4 (depends on the adapter).

    2. Create a VLAN interface and specify the IP address.

    3. Set egress mapping for TCP traffic (priority 0).

    4. Set egress mapping for RoCE traffic (priority 4).

    1. Enable PFC on priority 4.

    Here are two procedures that depend on the adapter installed on the server.

     

    Enable PFC on priority 4 for the ConnectX-3 adapter (Storage Client and Compute Node Server)

     

    1. Add the following line to the file: /etc/modprobe.d/mlx4_en.conf.

    Note: 0x10 is 00010000b, which means that only priority 4 is enabled on that host.

    options mlx4_en pfctx=0x10 pfcrx=0x10

     

    2. Restart the driver:

    #/etc/init.d/openidb restart

     

    3. To verify your configuration, run:

    # RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX

    0x10

    Enable PFC on priority 4 for the ConnectX-4 adapter (Storage Server)

    Follow this procedure for detailed information. HowTo Configure PFC on ConnectX-4

     

    Set Priority 4, to be enabled with PFC.

    # mlnx_qos -i eth35 --pfc 0,0,0,0,1,0,0,0

    PFC configuration:

      priority    0   1   2   3   4   5   6   7

      enabled     0   0   0   0   1   0   0   0 

    ...

     

    Another option is to use LLDP DCBX TLVs and allow the switch to configure PFC on the adapter. See HowTo Auto-Config PFC and ETS on ConnectX-4 via LLDP DCBX.

     

    2. Create an VLAN Interface and Set the IP Address.

     

    There are several ways to create a VLAN interface, and here is one of them. Refer to the distribution OS documentation for specific configuration for your OS release.

    Create a VLAN interface on each of the interfaces of the hosts, as the example shows:

    # modprobe 8021q

    # vconfig add eth35 100

    # ifconfig eth35.100 12.12.12.9/24 up

     

    Note: For RHEL/CentoOS 7 and above the vconfig command is not supported. Refer to HowTo Set Egress Priority VLAN on Linux for alternative configuration option of the VLAN interface.

     

    3. Set Egress Priority 0 for the TCP traffic.

     

    Note: vconfig is a Linux command used to configure egress priority per the VLAN. This command applies to flows passing through the kernel, such as TCP/IP protocols and applications based on them.

    # for i in {0..7}; do vconfig set_egress_map eth35.100 $i 0 ; done

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

    Set egress mapping on device -:eth35.100:- Should be visible in /proc/net/vlan/eth35.100

     

    Note: For RHEL/CentoOS 7 and above the vconfig command is obsolete, refer to HowTo Set Egress Priority VLAN on Linux for alternative configuration command option.

     

    4. Set Egress Priority 4 for the kernel bypass traffic (RoCE).

    RoCE traffic bypasses the kernel, so vconfig commands or other kernel related commands will not work.

    There are up to 16 skpio (kernel priorities) to be mapped to the L2 priorities (UP). In these cases we map all the priorities to L2 priority 4.

    # tc_wrap.py -i eth35 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4

    UP 0

            skprio: 0 (vlan 100)

            skprio: 1 (vlan 100)

            skprio: 2 (vlan 100 tos: 8)

            skprio: 3 (vlan 100)

            skprio: 4 (vlan 100 tos: 24)            <<--- This section is due to the vconfig set_egress_map (kernel flow).

            skprio: 5 (vlan 100)

            skprio: 6 (vlan 100 tos: 16)

            skprio: 7 (vlan 100)

    UP 1

    UP 2

    UP 3

    UP 4

            skprio: 0

            skprio: 1

            skprio: 2 (tos: 8)

            skprio: 3

            skprio: 4 (tos: 24)

            skprio: 5

            skprio: 6 (tos: 16)

            skprio: 7                                <<--- This section is due to the tc_wrap script (kernel bypass flow), for example RoCE

            skprio: 8

            skprio: 9

            skprio: 10

            skprio: 11

            skprio: 12

            skprio: 13

            skprio: 14

            skprio: 15

     

    UP 5

    UP 6

    UP 7

    #

     

    Verification

     

    To verify the configuration, run two types of traffic from the 50G ConnectX-4 interface to the 40G ConnectX-3 interface.

    For TCP traffic use the iperf tool.

     

    For example:

    For the server:

    # iperf -s

     

    For the client:

    # iperf -c 12.12.12.9 -P8 -t100

     

    For RoCE traffic use ib_write_bw command.

     

    For example:

    Be sure to use -S and -x flags, as needed.

    For the server:

    # ib_write_bw --report_gbits -D5 -d mlx4_0   -F  -x 2 -S 4

    For the client:

    # ib_write_bw --report_gbits -D5 -d mlx5_0  -F   -x 6 -S 4 12.12.12.9

    For more information, refer to HowTo Configure PFC on ConnectX-4 .

     

    Make sure that you clear port counters and QoS statistics before starting and after running each test.

    switch (config)# clear counters

    switch (config)# clear qos statistics

     

    1. As there are no priority traffic counters on the Cisco switch, check the switch output discard counters when sending only iperf traffic.

    Check to see that this counter is advancing.

    # show interface ethernet 1/20

    ...

      RX

        1472449 unicast packets  0 multicast packets  0 broadcast packets

        1472449 input packets  108980224 bytes

        0 jumbo packets  0 storm suppression packets

        0 runts  0 giants  0 CRC  0 no buffer

        0 input error  0 short frame  0 overrun   0 underrun  0 ignored

        0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop

        0 input with dribble  0 input discard

        0 Rx pause

      TX

        59065584 unicast packets  717 multicast packets  26 broadcast packets

        59066327 output packets  89883192913 bytes

        59047659 jumbo packets

        0 output error  0 collision  0 deferred  0 late collision

        0 lost carrier  0 no carrier  0 babble  444 output discard

        0 Tx pause

     

    2. Run only the ib_write_bw script and make sure that there are no drops (output discarding is 0). Also check to see if there are any pause frames.

    Check port counters as follows:

    # show interface ethernet 1/20

    ...

      RX

        1076151 unicast packets  0 multicast packets  0 broadcast packets

        1076151 input packets  88301122 bytes

        0 jumbo packets  0 storm suppression packets

        0 runts  0 giants  0 CRC  0 no buffer

        0 input error  0 short frame  0 overrun   0 underrun  0 ignored

        0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop

        0 input with dribble  0 input discard

        0 Rx pause

      TX

        46143630 unicast packets  104 multicast packets  7 broadcast packets

        46143741 output packets  50861825632 bytes

        0 jumbo packets

        0 output error  0 collision  0 deferred  0 late collision

        0 lost carrier  0 no carrier  0 babble  0 output discard

        0 Tx pause

     

    Check port PFC counters:

    On the 40G port (eth 1/20), you should see Rx PFC pause frames on priority 4, while on the 50G port (eth 1/22/1) port you should see Tx PFC frames on priority 4.

    switch (config)# show interface ethernet 1/20 priority-flow-control detail

     

    Ethernet1/20

        Admin Mode: On 

        Oper Mode: On 

        VL bitmap: (10)     

        Total Rx PFC Frames: 6820     

        Total Tx PFC Frames: 0        

        ---------------------------------------------------------------------------------------------------------------------

            |  Priority0  |  Priority1  |  Priority2  |  Priority3  |  Priority4  |  Priority5  |  Priority6  |  Priority7  |

        ---------------------------------------------------------------------------------------------------------------------

        Rx  |0            |0            |0            |0            |6820         |0            |0            |0           

        ---------------------------------------------------------------------------------------------------------------------

        Tx  |0            |0            |0            |0            |0            |0            |0            |0           

     

     

    3. Run both tests together (iperf and ib_send_bw).

    Check to see if PFC counters are running on priority 4 (RoCE traffic). Be sure to check if there are output discards on the port counters (there represent iperf traffic).

    The output is as shown above.

     

    4. Check the QoS Group mapping. Make sure that TCP traffic is mapped to QoS Group 0, while RoCE traffic is mapped to QoS group 4.

    switch(config)# show queuing interface ethernet 1/20

     

    slot  1

    =======

     

    Egress Queuing for Ethernet1/20 [System]

    ------------------------------------------------------------------------------

    QoS-Group# Bandwidth% PrioLevel                Shape                   QLimit

                                       Min          Max        Units  

    ------------------------------------------------------------------------------

          3             -         1           -            -     -            6(D)

          2             0         -           -            -     -            6(D)

          1             0         -           -            -     -            6(D)

          0           100         -           -            -     -            6(D)

    +-------------------------------------------------------------------+

    |                              QOS GROUP 0                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |         2324649|               0|               0|

    |        Tx Byts |       172065321|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 1                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 2                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 3                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 4                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |        64575649|               0|               0|

    |        Tx Byts |      5295203218|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 5                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 6                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                              QOS GROUP 7                          |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                      CONTROL QOS GROUP                            |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |            1307|               0|               0|

    |        Tx Byts |          101916|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

    |                         SPAN QOS GROUP                            |

    +-------------------------------------------------------------------+

    |                |  Unicast       | OOBFC Unicast  |  Multicast     |

    +-------------------------------------------------------------------+

    |        Tx Pkts |               0|               0|               0|

    |        Tx Byts |               0|               0|               0|

    |   Dropped Pkts |               0|               0|               0|

    |   Dropped Byts |               0|               0|               0|

    |   Q Depth Byts |               0|               0|               0|

    +-------------------------------------------------------------------+

     

    Port Egress Statistics

    --------------------------------------------------------

    WRED Drop Pkts                              0

     

    Ingress Queuing for Ethernet1/1

    ------------------------------------------------------------------

    QoS-Group#                 Pause                        QLimit

               Buff Size       Pause Th      Resume Th     

    ------------------------------------------------------------------

          7              -            -            -           10(D)

          6              -            -            -           10(D)

          5              -            -            -           10(D)

          4         206246       116623        89623            4(D)

          3              -            -            -           10(D)

          2              -            -            -           10(D)

          1              -            -            -           10(D)

          0              -            -            -           10(D)

     

     

    Port Ingress Statistics

    --------------------------------------------------------

    Ingress MMU Drop Pkts                              36

    Ingress MMU Drop Bytes                          40072

     

     

    PFC Statistics

    ----------------------------------------------------------------------------

    TxPPP:                39019, RxPPP:                    0

    ----------------------------------------------------------------------------

    COS QOS Group        PG   TxPause   TxCount         RxPause         RxCount

       0         -         -  Inactive         0        Inactive               0

       1         -         -  Inactive         0        Inactive               0

       2         -         -  Inactive         0        Inactive               0

       3         -         -  Inactive         0        Inactive               0

      4         4         1  Inactive     39019        Inactive               0

       5         -         -  Inactive         0        Inactive               0

       6         -         -  Inactive         0        Inactive               0

       7         -         -  Inactive         0        Inactive               0

    ----------------------------------------------------------------------------

    switch(config)#   

     

    5. Additional tests could be run monitor one of the ports and send the traffic to third server to check the via wireshark.

    Here is the running config of the switch used to create a monitor session:

    monitor session 1

      source interface Ethernet1/18 both

      source interface Ethernet1/20 both

      destination interface Ethernet1/2/2

      no shut

     

    Troubleshooting and Open Issues

    1. Make sure that you use the latest version of software on the switch. older version did not have option to map QoS policy.