HowTo Run RoCE over L2 Enabled with PFC

Version 36
    This post is showing to configure RoCE on top of L2 network based on MLNX-OS and enabled with Priority Flow Control (PFC).


    Note: Make sure you have the latest version of MLNX-OFED installed (having MLNX_EN is not enough).


    In the example, priority 3 is enabled, but it could be any other priority number.


    • 3x Hosts
    • 3x ConnectX-3, MLNX_OFED 2.1, RH6.4 (or latest)
    • 1x Switch SX1036 (or any other Mellanox Ethernet switch), MLNX-OS 3.3.4304 (or latest).

    Network Connectivity


    MLNX-OS Switch Configuration

    Create a VLAN and specify the switchport as "hybrid" (or trunk) mode:

    switch (config) # vlan 100

    switch (config vlan 100) # exit

    switch (config) # interface ethernet 1/1-1/3 switchport mode hybrid

    switch (config) # interface ethernet 1/1 switchport hybrid allowed-vlan all

    switch (config) # interface ethernet 1/2 switchport hybrid allowed-vlan all

    switch (config) # interface ethernet 1/3 switchport hybrid allowed-vlan all



    Enable PFC:
    Note: the configuration example below refers to SX switch, and not for Spectrum.

    switch (config) # dcb priority-flow-control enable

    switch (config) # dcb priority-flow-control priority 3 enable

    switch (config) # interface ethernet 1/1-1/3 dcb priority-flow-control mode on force

    To verify the PFC configuration run:

    switch (config)# show dcb priority-flow-control

    PFC enabled

    Priority Enabled List   :3

    Priority Disabled List 
    0 1 2 4 5 6 7

    TC     Lossless

    ---    ----------

    0           N

    1           Y

    2           Y

    3           N

    Interface      PFC admin        PFC oper

    ------------  --------------   -------------


    1/1            On               Enabled

    1/2            On               Enabled

    1/3            On               Enabled

    switch config) #

    For more information about QoS, refer to HowTo Configure QoS on Mellanox Switches (SwitchX).

    Server Configuration

    Add the following line to the file: /etc/modprobe.d/mlx4_en.conf:

    options mlx4_en pfctx=0x08 pfcrx=0x08              


    Note: When PFC is enabled, Global pause will be operationally disabled (no matter what is configured for the global pause flow control).


    Note: This command enables PFC on the host. The parameters, pfctx (PFC TX) and pfcrx (PFC RX), are specified per host. If you have more than one card on the server, all ports will be enabled with PFC (global pause will be disabled even if configured).

    The value is a bitmap of 8 bits = 8 priorities. If you want to enable PFC on all priorities, you need to configure 0xff. We recommend that you enable only lossless applications such as RoCE on specific priority.

    If you want to run more than one flow type (e.g. TCP and RoCE) on the server, you should turn on only one priority (e.g. priority 3), which should be configured with the parameters “0x08” = 00001000b (binary). Only the 4th bit is ON (starts with priority 0,1,2 and 3 -> 4th bit).

    Restart the InfiniBand switch as follows:
    #/etc/init.d/openibd restart
    To verify your configuration, run:

    # RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX


    Create a VLAN interface on each of the interfaces of the hosts, for example:

    # modprobe 8021q

    # vconfig add eth1 100

    # ifconfig eth1.100 up



    Set the egress priority for the VLAN:

    Note: The vconfig command is a Linux command used to configure egress priority per VLAN. This command applies to flows passing through the Kernel such as TCP/IP protocols and applications based on that.

    # for i in {0..7}; do vconfig set_egress_map eth1.100 $i 3 ; done

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100

    Set egress mapping on device -:eth1.100:- Should be visible in /proc/net/vlan/eth1.100




    Note: For RHEL/CentoOS 7 and above the vconfig command is obsolete, refer to HowTo Set Egress Priority VLAN on Linux for alternative configuration option.



    Map skb_prio to UP (User Priority):

    This is an MLNX_OFED script to set egress priority for types of traffic (such as RoCE) to bypass the kernel stack. It maps kernel priority called “skb_prio” to user priority by adding the VLAN tag.

    As you can see in the output, priority 3 is mapped for VLAN100 (RoCE).

    # -i eth1 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

    UP 0

    UP 1

    UP 2

    UP 3

            skprio: 0

            skprio: 1

            skprio: 2 (tos: 8)

            skprio: 3

            skprio: 4 (tos: 24)

            skprio: 5

            skprio: 6 (tos: 16)

            skprio: 7

            skprio: 8

            skprio: 9

            skprio: 10

            skprio: 11

            skprio: 12

            skprio: 13

            skprio: 14

            skprio: 15

            skprio: 0 (vlan 100)

            skprio: 1 (vlan 100)

            skprio: 2 (vlan 100 tos: 8)

            skprio: 3 (vlan 100)

            skprio: 4 (vlan 100 tos: 24)

            skprio: 5 (vlan 100)

            skprio: 6 (vlan 100 tos: 16)

            skprio: 7 (vlan 100)

    UP 4

    UP 5

    UP 6

    UP 7




    Verification Procedure



    The following commands create two RoCE flows to host S1, as shown in the diagram in the Network Connectivity section.


    // Run on host S1

    # ib_write_bw -R --report_gbits --port=12500 -D 10 & ib_write_bw -R --report_gbits --port=12510 -D 10


    // Run on host S2

    # ib_write_bw -R --report_gbits  --port=12500 -D 10


    // Run on host S3

    # ib_write_bw -R --report_gbits  --port=12510 -D 10



    Read port priority counters from the Mellanox switch (via MLNX-OS CLI) as follows:

    switch (config) # show interfaces ethernet 1/1 counters priority 3



    333364              packets

    333364              unicast packets

    0                   multicast packets

    0                   broadcast packets

    362177148           bytes

    14814               pause packets

    8                   pause duration seconds



    333371               packets

    333362               unicast packets

    6                    multicast packets

    3                    broadcast packets

    368845148            bytes

    0                    pause packets



    Read port priority counters from the host as follows:

    # ethtool -S eth1 | grep prio_3


         rx_prio_3_packets: 5152

         rx_prio_3_bytes: 424080  
          tx_prio_3_packets: 328209

              tx_prio_3_bytes: 361752914

              rx_pause_prio_3: 14812
          rx_pause_duration_prio_3: 0  
          rx_pause_transition_prio_3: 0

           tx_pause_prio_3: 0  
          tx_pause_duration_prio_3: 47848     
          tx_pause_transition_prio_3: 7406



    Advanced QoS

    When you want to have two priorities (for example, priorities 3 and 5) on the same VLAN, the RoCE application can be configured to use the sk_prio field.

    In the following example, the application uses priority “3” and “5”, per sk_prio (1..16 kernel priority). Here, the application uses sk_prio 3 for priority 3 and sk_prio 5 for priority 5, and all the rest is not used (priority 0).


    You should run the following command (only sk_prio 3 and 5 are mapped):

  i- eth1 –u 0,0,0,3,0,5,0,0,0,0,0,0,0,0,0,0


    In this case the application is able to select the proper priority for the application and the packet will egress with the proper priority bit on the VLAN tag.

    You might also want to configure the switches on the network to give higher bandwidth for each Traffic Class (TC) in the network (the priority is mapped to TC). Refer to HowTo Configure QoS on Mellanox Switches for additional information.