Understanding QoS Configuration for RoCE (Profile Selection)

Version 7

    This post explains the Classification and QoS features for RoCE deployment.

     

    References

     

    The need for QoS in network for RoCE

    RDMA was initially designed for InfinBand networks that run HPC applications. InifniBand networks are lossless by definition and HPC applications are more network-friendly traffic, hence have lower demand on network QoS.

    Data center network run arbitrary traffic scenarios, hence require QoS enforcement to cope with various cases.

     

    What do we mean by Network QoS?

    The ability to classify flows in the network, and to supply to each one of them different characteristics, such as buffer allocation, flow control, RED/ECN, scheduling and so on.

    Each flow could be treated differently via the network switches.

    See also Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters.

     

    Network Flow Classification

    In the IP/Ethernet headers there are two ways to classify packets in the network.

    1. Using L2 Priority bits on the Ethernet header

    2. Using DSCP bits on the IP header.

     

    What are the L2 Priority bits (PCP)?

    The L2 priority bits known also as PCP bits, are 3 bits, part of the VLAN header (2 bytes) of the Ethernet header.

     

    Image result for L2 vlan header

     

    What are the ToS/DSCP/ECN bits?

     

    Those bits are part an 8 bit field in the IP header.

    • ToS byte, 8 bits
    • DSCP bits , part of the ToS byte consist of the 6 MSBs
    • ECN bits, two LSBs of the ToS byte

     

     

    The IP Header

     

    Image result for tos dscp ecn field ip

     

    For example, ToS 105 maps to DSCP 26, ECN 01

     

     

    What is Trust L2/L3?

     

    The network switches and network adapters on the servers should be configured where in the packet to look in order to classify the packets.

    Trust options consist of the following (the switch have more options)

    • Trust Layer-2 (L2) - Trust the PCP  bits  VLAN header. The network element will take the priority from the L2 priority bits and map it to the right buffer/queue/priority.
    • Trust Layer-3 (L3) - Trust the DSCP bits in the IP header. The network element will take the priority from the DSCP bits  and map it to the right buffer/queue/priority.

     

    See also:

     

     

    Traffic Control Mechanisms

    The are two mechanisms for traffic control, both could be enabled, or only each of them.

     

    1. Using Flow Control or PFC

    2. Using DCQCN (aka ECN) - RoCE Congestion Control

     

    Flow Control and PFC

    Flow Control is a link layer protocol. Enabling Link Level Flow Control or PFC in the network creates a lossless network, no packets will be lost.

    PFC is doing that per priority, while Link Link Level Flow Control is doing that per port.

    In case of congestion in the network element, it will be send pause on the link to pause the traffic until the congestion is released.

    PFC could be enabled with Trust L2 or Trust L3.

     

    DCQCN

    DC-QCN is a congestion control protocol for RoCE. It is based on Explicit Congestion Notification (ECN) feature on network switches, to inform the sender to throttle the injection rate, upon congestion.

    For more info about DC-QCN read: Understanding RoCEv2 Congestion Management.

     

    RDMA QoS Mapping on the Adapter and Linux Kernel

     

    Linux Application written over RDMA have several ways to set the DSCP or L2 Priority bits.

    Linux uses skprio (socket priority) which needed to be mapped L2 priority (e.g. using the command vconfig).

     

    RDMA CM Considerations

    RDMA CM API doesn't have the ability to set the Service Level (SL) which maps to L2 priority on the Ethernet header, only the ToS  bits. Applications written on top of RDMA CM can configure the ToS.

    If an application is already written and doesn't give the option to the user to set the ToS, there is an option in MLNX_OFED to change the default ToS for RDMA CM applications.

     

    ToS to skprio mapping

    ToS to skprio mapping has a default table mapping, see Default ToS to skprio mapping on Linux

    ToS 105, for example, is mapped to skprio 2

     

    Example

     

    Application that uses RDMA CM, maps ToS 105 which is mapped by default to skprio 2.

    In case we need L2 priority, using vconfig to map to egress L2 priority 3.

     

    ToS 105 > skprio 2 > L2 priority 3

    # cma_roce_tos -d mlx5_0 -t 105                  # 105 is mapped to skprio 2

    # vconfig set_egress_map <vlan-interface> 2 3    # Map sk_prio=2 to SL=3 (L2 priority 3)

     

     

    Getting Started

    At first, you will need to determine which network characteristics suits you most.

    Mellanox recommends to enable both ECN and PFC in the network adapters and the switches for best performance. However, some customers may not want to enable flow control in the network, or don't want to add VLANs from different reasons, if the traffic pass routers it would be better to configure the network to trust L3 (look at the DSCP field for classification) while if the network is L2 network, trust L2 should be ok.

    We tried to build several network profiles to help you select the one that suits you most, see Getting Started with RoCE Configuration to select your most suitable network profile, and set the adapters and switches accordingly.

     

    Example 1 (Fast and Easy)

    Assuming we wish to configure two servers connected via one switch to test storage benchmarks.

    Let's write for ourselves some possible network considerations, to be able to select the most suitable networking profile.

    Assuming the Adapters are ConnectX-4 and the network switches are Spectrum Switches.

    Network ConsiderationsImply Decision

    The network, for example only one switch.

    We don't want to do any major switch configuration, we wish to keep it out-of-the-box possibly, with minimum configuration.

    We wish to keep mostly, OOB configuration.

    The is only one class of traffic, as only RDMA traffic is being sent over the switches. no other traffic.

    We can enable ECN on all priorities.

    PFC is disabled

    no specific QoS configuration (switch buffers, priority mapping and so on)

     

     

    Going over the profiles defined in  Getting Started with RoCE Configuration , we can see that the suitable profile is Profile 1.

     

    Profile 3: No QoS, lossy

    • No VLAN
    • ECN enabled on all priorities (priority 0 can be used)
    • No PFC
    • No DSCP marking
    • All default configuration

     

    Here are the Configuration examples for the adapter and switch:

     

    Example 2 (Lossy, multi-tier network)

    Assuming we wish to configure a storage server installed with NVMe devices and several hosts (NVMe Clients).

    Let's write for ourselves some possible network considerations, to be able to select the most suitable networking profile.

    Assuming the Adapters are ConnectX-4 and the network switches are Spectrum Switches.

    Network ConsiderationsImply Decision
    The network, for example, is a two tier network consists of routers (router ports) and routing protocol such as OSPF in between.

    We better use Trust L3 network classification, as it is not common to add VLANs between the router ports.

    In addition, the IP header doesn't change between the routers, while the Ethernet header is replaced.

    If due to other reasons, we still wishes to set the network to Trust L2, we need to make sure the following:

    - that all the links contains VLANs

    - The L2 priority is mapped from one subnet to the other.

    The network contains many traffic flows, RDMA, TCP or others.

    Classification is important. Need to configure the network to differentiate between the RDMA traffic and the other traffic flows.

    RDMA traffic should be marked with ToS 105/DSCP 26 (for example)

    Assuming also that the customer requirements are not to enable flow control in the network, without entering to the reasons.

    FC/PFC should be disabled.

    ECN only configuration, the network will be lossy.

    VLANs are optional between the servers and the ToR switchesNo need to add L2 priority to this VLAN. When selecting Trust L3, the classification will be by using the DSCP field of the packets.

     

     

    Going over the profiles defined in  Getting Started with RoCE Configuration , we can see that the suitable profile is Profile 3.

     

    Profile 3: L3 based QoS, lossy

    • Trust L3
    • No VLAN
    • ECN priority 3
    • No PFC
    • DSCP marking 26

     

    Here are the Configuration examples for the adapter and switch: