Understanding QoS Configuration for RoCE

Version 10

    This post explains about the classification and QoS configuration for RoCE deployment.




    The need for QoS in RoCE Networks

    RDMA was initially designed to be used in InfiniBand networks that run HPC applications. InfiniBand networks are lossless by specification, and HPC applications are usually  optimized for network performance, hence have more network-friendly traffic. Therefore, HPC networks have lower demand for  QoS configuration.

    On the other hand, data center networks run arbitrary traffic scenarios. These demand higher QoS requirements to be able to cope with various cases.


    What is Network QoS?

    QoS provides the ability to classify flows to priorities (or classes), and supply to each priority various characteristics such as buffer allocation, flow control, RED/ECN, queuing, scheduling, etc. For more information, see Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters.


    Network Flow Classification

    In the IP/Ethernet headers, there are two ways to classify packets in the network:

    1. By using PCP bits on the VLAN header

    2. By using DSCP bits on the IP header


    What are the PCP Bits?

    The  PCP bits, are 3 bits that are part of the VLAN header.


    Image result for L2 vlan header


    What are the DSCP Bits?

    Those are 6 bits  part of an 8-bit TOS field in the IP header.



    Image result for tos dscp ecn field ip





    What is Trust L2/L3?

    The network switches and network adapters on the servers should include a pointer to the packets in their configuration in order to classify them.

    • Trust Layer-2 (L2) - Trust the PCP bits VLAN header. The network element will take the priority from the L2 priority bits and map it to the right buffer/queue/priority.
    • Trust Layer-3 (L3) - Trust the DSCP bits in the IP header. The network element will take the priority from the DSCP bits and map it to the right buffer/queue/priority.


    For more information, see:



    Traffic Control Mechanisms

    Two following mechanisms for traffic control that can be enabled simultaneously or separately.

    1. Flow Control (PFC)

    2. Congestion Control (DCQCN)


    1. Flow Control

    Flow control is a link layer protocol. Enabling link level flow control or PFC in the network creates a lossless network, ensuring that no packets are dropped.

    PFC pauses traffic per priority, while link level flow control pauses traffic per port.

    When there is congestion in the network element, it will be sent to the link to pause the traffic until congestion is released.


    2. DCQCN

    DC-QCN is a congestion control protocol for RoCE. It is based on the Explicit Congestion Notification (ECN) feature on network switches, to inform the sender to throttle the injection rate upon congestion.

    For more information about DC-QCN, seeUnderstanding RoCEv2 Congestion Management.


    RDMA QoS Mapping on the Adapter and Linux Kernel

    Linux Application written over RDMA offers several ways to set the DSCP or L2 Priority bits.

    Linux uses skprio (socket priority) which needed to be mapped L2 priority (e.g. using the command vconfig).


    RDMA CM Considerations

    RDMA CM API does not have the ability to set the Service Level (SL) that maps to L2 priority on the Ethernet header, only the ToS bits. Applications written on top of RDMA CM can configure the ToS.

    If an application is already written and does not give the option to set the ToS, it can be done in MLNX_OFED.


    ToS to skprio Mapping

    ToS to skprio mapping has a default table mapping, see Default ToS to skprio mapping on Linux.

    ToS 105, for example, is mapped to skprio 2



    An application that uses RDMA CM, maps ToS 105 which is mapped by default to skprio 2.

    In case L2 priority is needed, vconfig is used to map to egress L2 priority 3.

    ToS 105 > skprio 2 > L2 priority 3

    # cma_roce_tos -d mlx5_0 -t 105                  # 105 is mapped to skprio 2

    # vconfig set_egress_map <vlan-interface> 2 3    # Map sk_prio=2 to SL=3 (L2 priority 3)


    Getting Started

    As a start, you need to determine which network characteristics suit you best.

    Mellanox recommends to enable both ECN and PFC in the network adapters and switches for best performance. However, some customers may not want to enable flow control in the network, or do not want to add VLANs for various reasons. If traffic passes routers, it is recommended to configure the network to Trust L3 (look at the DSCP field for classification). If the network is an L2 network, it is recommended to configure it to Trust L2.

    For all network profiles, see Recommended Network Configuration Examples for RoCE Deploymentand select your respective profile to help you set up your network.


    Example 1 (Fast and Easy)

    Assuming two servers connected via one switch are configured to test storage benchmarks, let's write some possible network considerations, to be able to select the most suitable networking profile.

    Assuming the adapter card is ConnectX-4, and the network switches are Spectrum switches.

    Network ConsiderationsImply Decision

    The network, for example only one switch.

    We don't want to do any major switch configuration, we wish to keep it out-of-the-box possibly, with minimum configuration.

    We wish to keep mostly, OOB configuration.

    There is only one class of traffic, as only RDMA traffic is being sent over the switches. No other traffic.

    We can enable ECN on all priorities.

    PFC is disabled

    no specific QoS configuration (switch buffers, priority mapping and so on)



    Example 2 (Lossy, Multi-tier Network)

    Assuming a storage server installed with NVMe devices and several hosts (NVMe Clients) is configured, let's write some possible network considerations, to be able to select the most suitable networking profile.

    Assuming the adapter card is ConnectX-4, and the network switches are Spectrum Switches.

    Network ConsiderationsImply Decision
    The network, for example, is a two tier network consists of routers (router ports) and routing protocol such as OSPF in between.

    We better use Trust L3 network classification, as it is not common to add VLANs between the router ports.

    In addition, the IP header doesn't change between the routers, while the Ethernet header is replaced.

    If due to other reasons, we still wishes to set the network to Trust L2, we need to make sure the following:

    - that all the links contains VLANs

    - The L2 priority is mapped from one subnet to the other.

    The network contains many traffic flows, RDMA, TCP or others.

    Classification is important. Need to configure the network to differentiate between the RDMA traffic and the other traffic flows.

    RDMA traffic should be marked with ToS 105/DSCP 26 (for example)

    Assuming also that the customer requirements are not to enable flow control in the network, without entering to the reasons.

    FC/PFC should be disabled.

    ECN only configuration, the network will be lossy.

    VLANs are optional between the servers and the ToR switchesNo need to add L2 priority to this VLAN. When selecting Trust L3, the classification will be by using the DSCP field of the packets.