This post discusses network considerations for networks that require QoS and no packet drops (loss-less).
- Priority Flow Control: Build Reliable Layer 2 Infrastructure
- Introduction to 802.1Qbb (Priority-based Flow Control — PFC)
- Introduction to 802.1Qaz (Enhanced Transmission Selection – ETS)
- Global Pause: IEEE 802.3x port based Flow Control.
- Priority Flow Control (PFC): IEEE 802.1Qbb , priority based Flow Control.
- Quality of Service (QoS): The ability to give better service for specific customers (traffic flows) than others.
- Traffic Classes (TC) or Class of Service (CoS): Traffic class (or class of service) is a group of all flows that receive the same service characteristics (e.g. buffer size, scheduling). It is possible that some flow with different priorities will be mapped to the same traffic class.
- Egress Scheduling: The egress port scheduling mechanism. It could be strict-priority based, or round robin or any mixture.
- Enhanced Transmission Selection (ETS): Weighted round robin egress scheduling standard.
What is 802.3x Flow Control (Global Pause)?
The Ethernet standard (802.3) was designed as unreliable. There was no guarantee for packets to reach the required destinations as it was designed to be done in upper layer protocols (e.g. TCP).
Later on, the IEEE 802.3x (Annex 31B of 802.3) flow control standard was defined for applications that cannot build reliability on the upper layers protocols. It enables receiving buffer feedback (e.g. overflow) from a receiver to its sender.
The pause action (XOFF) is a control frame sent by the receiver to alert the sender that the receiver buffer is stressed and will potentially be overflowed. The sender responds by stopping transmission of any new packets until the receiver is ready to accept them again. The pause frame contains a timeout value. The sender will wait during this timeout or until XON control message is received.
IEEE 802.3x suffers from a basic disadvantage: after a link is paused, a sender cannot generate any more packets. As a result, when using flow control on a port (global pause), the Ethernet link cannot carry multiple traffic flows that require different QoS behavior, as when enabling flow control on a port, it pauses all types of traffic on that port including the ones that require high QoS. Moreover, if the link is between two switches in the network, the pause action may block servers flows that do not need to be paused.
What is 802.1Qbb Priority Flow Control (PFC)?
IEEE 802.1Qbb PFC extends the basic IEEE 802.3x to multiple classes (8 classes). It enables applications that require flow control to coexist on the same link with applications that can manage without flow control. PFC defines each one of the eight different types of flows that can be subject to flow control. In case of L2 network, PFC uses the priority bits within the VLAN tag (IEEE 802.1p) to differentiate up to eight types of flows that can be subject to flow control (each one independently).
Note: PFC and Global pause flow control cannot be running together on the same interface, either one of them can be enabled.
What is Quality of Service (QoS)?
QoS is the ability to give different levels of service for different types of applications (e.g. higher service for more important traffic flows and lower service for less important application types).
QoS does not force the usage of PFC. There could be networks with multiple traffic classes that none of the traffic classes require no-drop (loss-less) characteristics.
When QoS is disabled in the network, only one ingress buffer can be used in the switches. When enabling QoS in the network, there is a must to use multiple buffers (one for each traffic class) for the ingress traffic to differentiate between the traffic flows.
In this example, when QoS is enabled, there are three traffic classes: TC1, TC2, and TC3. Each one has its own buffer. In case TC1 buffer is full (packets are being dropped or traffic is paused), it does not affect TC2 or TC3 buffers and QoS policy.
When do I need to enable QoS? When do I need to enable PFC?
In case there is more than one traffic class in the network, QoS should be enabled, giving proper service for each traffic class.
In case one of the traffic classes needs no packet drops, PFC should be enabled on the proper priority that is mapped to the TC that we wish to have no drops on.
For example, if priority 3 is mapped to TC1 and we run RDMA application on this priority, we wish TC1 to be loss-less (no drops on TC1 buffer). Therefore, we will enable PFC on priority 3.
Refer to HowTo Run RoCE and TCP over L2 Enabled with PFC as for example.
What is egress scheduling?
The scheduling algorithm according to which the port transmits packets. There are two basic scheduling modes:
1. Weighted Round Robin (WRR) - All packets are being transmitted round robin according to configured weight for each traffic class. ETS is an example of WRR.
2. Strict Priority (SP) - All packets are being transmitted according to their priority.
What is the effect of egress scheduling configuration on QoS?
Egress scheduling configuration may affect the traffic class actual bandwidth.
Refer to HowTo Configure QoS on Mellanox Switches (SwitchX) for examples.
What is the recommendation for non-green field installations?
Assuming I already have running setup that I don't want to touch, and I would like to add more servers with an application that requires loss-less network (such as RDMA), what can I do?
Will PFC work along with Global Pause (Flow Control) on the same network?
In general, PFC and Global pause can be enabled on the same network. It is less recommended as some of the applications running on the network ports that are globally paused may be affected (e.g. TCP traffic).
In case it is a large network, it is highly recommended that the links between the switches will be QoS and PFC enabled on the proper priority, otherwise all traffic going via those switches will be paused.
What is DCBX?
DCBX is an extension to Link Layer Discovery Protocol (LLDP) that supplies the ability of auto-configuration of Data Center Bridging (DCB) parameters from the network to the servers (e.g. PFC).
The usage of DCBX is optional and not mandatory. In most case, the server will be configured manually with various configurations including PFC.