HowTo Run RoCE over L2 Enabled with PFC (ESXi)

Version 11

    This post describes how to configure Priority Flow Control (PFC) for RoCE on ESXi Rel. 6.0 native using a Mellanox Ethernet switch.

    Note: ESX Rel. 5.5. does not include the RDMA driver.

     

     

    References

     

    Setup

    • 2x Hosts
    • 2x ConnectX-3/ConnectX-4, or any combination thereof.
    • 1x Mellanox Ethernet Spectrum Switch SN2700

     

    switch-conf.png

     

    Switch Configuration

    Depending on the switch you have installed, refer to the following links for instructions for configuring PFC on the switches:

     

    For PFC configuration of other third party switches, refer to RDMA/RoCE and Storage Solutions.

     

    Driver Configuration

    1. Configure PFC

    Configure PFC on Mellanox drivers (nmlx drivers). Note that there is a different driver for each adapter.

    In this example we will enable PFC on priority 3 on the receive (Rx) and transmit (Tx).

     

    The following command enables PFC on the host. The parameters, pfctx (PFC TX) and pfcrx (PFC RX), are specified per host. If you have more than one card on the server, all ports must be enabled with PFC. Note that global pause will be disabled even if it is configured.

    The value is a bitmap of 8 bits = 8 priorities. Priority 3 is mapped to the fourth bit. Only the fourth bit is ON to start with priority 0,1,2 and 3 -> fourth bit. “0x08” = 00001000b (binary).

     

    For ConnectX-3 specify:

    # esxcli system module parameters set -m nmlx4_en -p "pfctx=0x08 pfcrx=0x08"

    For ConnectX-4 specify:

    # esxcli system module parameters set -m nmlx5_core -p "pfctx=0x08 pfcrx=0x08"

     

     

    Note: When PFC is enabled, global pause will be operationally disabled (no matter what is configured for global pause flow control).

     

    Note: pfctx and pfcrx must be equal.

     

    To read the current module configuration, run:

     

    For ConnectX-3:

    # esxcli system module parameters list -m nmlx4_en

     

    For ConnectX-4:

    # esxcli system module parameters list -m nmlx5_core

     

    2. Configure Global RDMA PCP (L2 Egress Priority) and DSCP Values (Optional)

    The RMDA service level (sl) field for the address handles user priority and is mapped to the PCP portion of the VLAN tag.

    The traffic class (tc)  field of the address handles the GRH header and is mapped to the IP header's DSCP bits.

     

    You can force PCP and DSCP values (for RDMA traffic only).

    The RDMA driver (nmlx5_rdma) supports global settings for the PCP (sl) and DSCP (traffic class) through the following module parameters:

    1. pcp_force: values: (-1) - 7, default: (-1 = off)

      The specified value will be set as the PCP for all outgoing RoCE traffic, regardless of the sl value specified. This parameter cannot be enabled when dscp_to_pcp is enabled.

    2. dscp_force: values: (-1) - 63, default: (-1 = off)

      The specified value will be set as the DSCP portion (6 bits) of the Type of Service (ToS) (8 bits) for all outgoing RoCE traffic, regardless of the traffic class specified.

    3. dscp_to_pcp: values 0 (off) - 1 (on), default: 0

      When enabled, the three MSBs of the DSCP value will be considered as the PCP for all outgoing RoCE traffic. If dscp_force is not used, then the DSCP value used for mapping is taken from the traffic class field in the GRH header. Otherwise, it takes the value set in dscp_force.

      This parameter cannot be enabled when pcp_force is enabled.

     

     

    For example, to force the PCP value to egress with a value of 3:

     

    For ConnectX-3:

    # esxcli system module parameters set -m nmlx4_rdma -p "pcp_force=3"

     

    For ConnectX-4:

    # esxcli system module parameters set -m nmlx5_rdma -p "pcp_force=3"

     

    The following table illustrates how these module parameters interact (assuming pcp_force=3 and dscp_force=24 are enabled):

    pcp_forcedscp_forcedscp_to_pcpEgress PCP ValueEgress DSCP value
    -1 (off)-1 (off)0 (off)uses value of ‘sl’uses value of ‘traffic class’
    -1 (off)-1 (off)1 (on)

    uses the 3 MSBs of ‘traffic class’

    uses value of ‘traffic class’
    -1 (off)24 (on)0 (off)uses value of ‘sl’24 (uses the value of dscp_force)
    -1 (off)24 (on)1 (on)3 (it uses 3 MSBs of dscp_force, 24= 011000b)24 (uses the value of dscp_force)
    3 (on)-1 (off)0 (off)3 (use the value of pcp_force)uses value of ‘traffic class’
    3 (on)-1 (off)1 (on)InvalidInvalid
    3 (on)24 (on)0 (off)3 (use the value of pcp_force)24 (uses the value of dscp_force)
    3 (on)24 (on)1 (on)InvalidInvalid

     

    3. Reboot the servers

     

     

    ESXi VLAN Configuration

    The topology below describes two machines. Both of them have vmnic5 as the adapter uplink.

    dvswitch-top.jpg

    To set the VLAN ID for the traffic to 100, run:

    1. Edit the distributed port group settings.

    2. Choose "VLAN" from the left panel.

    3. Set the VLAN type to "VLAN".

    4. Set the VLAN tag to "100".

    5. Click "OK".

     

    For more information refer to VMWare documentation.

     

    Verification

    For verification purposes, when you are using Mellanox switches you can lower the speed of one of the switch ports, forcing the use of PFC pause frames due to insufficient bandwidth:

    switch (config) # interface ethernet 1/1 shutdown
    switch (config) # interface ethernet 1/1 speed 10000
    switch (config) # no interface ethernet 1/1 shutdown 

     

    Note: You can create congestion to force PFC to be enabled using other methods. For example, you can use two hosts to send traffic to a third host, which is a simple configuration.

     

    Run traffic between the hosts on priority 3.

     

    Note: The final PCP and DSCP values will be decided by the pcp_force, dscp_force, and dscp_to_pcp module parameters as described above.

     

    Note: If PFC for a priority is not enabled by pfctx and pfcrx, the HCA counters for that priority will not increment, and the data will be counted on priority 0 instead.

     

    See that both the HCA and switch transmitted/received pause frames on priority 3:

    # vsish -e cat /net/pNics/vmnic5/stats | grep -e "Pause\|PerPrio"

       PerPrio[0]

       rxPause : 0

       txPause : 0

       PerPrio[1]

       rxPause : 0

       txPause : 0

       PerPrio[2]

       rxPause : 0

       txPause : 0

       PerPrio[3]

       rxPause : 3348591

       txPause : 12217

       PerPrio[4]

       rxPause : 0

       txPause : 0

       PerPrio[5]

       rxPause : 0

       txPause : 0

       PerPrio[6]

       rxPause : 0

       txPause : 0

       PerPrio[7]

       rxPause : 0

       txPause : 0