An InfiniBand network was designed as a lossless network. When there is congestion, instead of dropping packets to deal with it, the InfiniBand's standard practice is not to send data unless the receiver has room for it.
This post discusses InfiniBand (IB) credit loops and the roles of topology and routing algorithm choice in preventing credit loops.
- Understanding Up/Down InfiniBand Routing Algorithm
- Cabling Considerations for Clos-5 Networks
- VPI Gateway Considerations
- InfiniBand, Gateway and Long Haul Solutions
Understanding Credit Loops
Every HCA port and Switch port implements a credit mechanism between the senders and the receivers on the link. It is implemented for each link direction, and for each Virtual Lane (VL) that is implemented by the hardware at each end of the physical link.
Multiple credit ‘state machines’ and separate buffers per VL contribute to having:
- Traffic on one VL that will not interfere with traffic on another VL
- Within a VL, traffic in one direction that will not interfere with traffic in the reverse direction
While a physical link is active, the sender and receiver on each direction of each VL are continually communicating, so the sender knows how much buffer space remains for that VL on the receiving side. The amount of available space is expressed in terms of ‘credits’, where a credit represents a certain number of bytes. If the receiving VL buffer becomes full, the sender stops sending data until more space (credit) becomes available. When a receiver reports that there are zero available credits, it begins applying ‘backpressure’ to the sender.
For HCA ports, the credit mechanism allows the adjacent switch to stop sending data to the server if the IB application is not consuming data fast enough. This can occur when there is inefficient code, a kernel panic, or overloaded host CPUs, as examples. For outbound traffic from an HCA port, the credit mechanism enables data to stay in host memory until the adjacent switch is ready for it.
Within the fabric, if Switch S has a Port P that is receiving back pressure from a neighboring Switch N on Virtual Lane X, packets will be placed in the VL X transmit queue, destined for Port P. If enough packets accumulate, Switch S will run out of receive buffer resources for VL X. It will report that no credits are available for VL X, thus exerting back pressure on VL X on all of its ports. In this manner, back pressure from one port in the fabric can propagate through the fabric—all the way from the destination HCA to the source HCA(s), if necessary.
If the source of back pressure persists, packets that have been waiting for too long in transmit queues will time out and be dropped. This Head of Queue (HOQ) timeout is specified via the Subnet Manager. Traffic running on reliable transport services is automatically retransmitted by the source HCAs. This timeout process is designed to prevent fabric deadlock. It is also the only time IB packets are deliberately dropped.
Like many other networks, InfiniBand dislikes loops. Specifically, it dislikes logical loops where link back pressure can create a deadlock situation. These are called credit loops. Although the HOQ timers periodically clear such a deadlock, performance suffers. A credit loop only represents a potential deadlock, which depends on the traffic at each link in the loop. However, at InfiniBand speeds such a deadlock can occur very quickly given the right traffic pattern.
The following diagram illustrates a very simple credit loop.
In this example:
- Four Switch ASICs are connected in a ring topology.
- A host adapter (HCA) connects a server (Node) to each Switch, with an additional HCA on Switch 4.
- Routes have been set by the Minhop routing algorithm.
- All traffic is sent on the same Virtual Lane.
- Node A streams messages to node C.
- Node B streams to node D.
- Node C streams to node A.
- Node E streams to node B.
With this traffic pattern, each IB cable carries two data streams.
If for some reason Node D’s application becomes unable to accept packets from Node B:
- HCA D exerts back pressure on Switch 4
- Switch 4 exerts back pressure on its link from Switch 3
- Switch 3 puts back pressure on its link from Switch 2, and also on HCA C
- Switch 2 exerts back pressure on HCA B, but due to the A->B->C stream also puts back pressure on its link from Switch 1
- Switch 1 puts back pressure on its link from Switch 4 and also on HCAs D and E
At this point, none of the HCAs can send data. If Node D does not recover in time, the Head of Queue timers in the switches expire and all packets are dropped. Packets using reliable transport services will be retried. Overall, both latency and bandwidth will suffer.
Here are the necessary ingredients for an IB credit loop:
- A physical loop.
- A set of node-to-node traffic paths (on the same VL) in which there is a circular dependency among the paths. In the previous diagram, four paths set up by the Subnet Manager have a mutual dependency, which can create a deadlock if one destination stalls and traffic on all paths is sufficiently high.
Note: Every path that contributes to a circular dependency will contain at least three switches. The previous diagram shows four paths, each of which includes three switches.
Credit Loops in IB Fabrics
There are physical loops in all but the smallest practical IB fabrics. This can be more obvious for topologies such as torus, mesh, and hypercube, but it is equally true for the standard ‘fat tree’. Technically a tree topology has no loops, but IB ‘trees’ have many physical loops unless they are 2-tier (aka Clos-3) with only one L2 (spine/ core) switch, for example if they are a true tree.
The following diagram shows a small Clos-3 fabric. It contains multiple physical 4-switch loops, similar to the previous diagram. One loop is highlighted. Our goal is to prevent physical loops from becoming credit loops.
Preventing Credit Loops
Although lightly-loaded IB fabrics containing credit loops can run for long periods without problems, the best practice is to eliminate credit loops in the fabric design.
The most common techniques for eliminating credit loops involve using:
- A good physical topology, including good node placement
- Better routing algorithms
- Multiple Virtual Lanes
Some of these methods impact system performance or increase overhead costs.
Loop Elimination via Physical Topology
The previous trivial loop example can be made free of credit loops in several ways. Four ways are shown in the following diagram:
Description of Each Sample Topology (a through d):
- Eliminates an inter-switch connection, like the one that is located between Switches S2 and S3, because it breaks the physical loop, so no credit loop is possible.
- Adds a cable between opposite Switches, for example between Switches S2 and S4, which actually creates more physical loops. In this case it creates two new 3-Switch loops. However, because all IB routing algorithms are designed to use the ‘shortest path first’, all traffic streams in this fabric are ‘point-to-point’, involving only two Switches. There will be no data streams that go through a third Switch. Even if a Switch runs out of VL buffer space, credit backpressure can not propagate to form a circular deadlock.
- Removing all nodes from one Switch (S2 in this example) eliminates 3-Switch paths that pass through Switches S1 and S3. Node B was moved from S2 to S3. In this design 3-Switch data travels through S2 (in this case A <-> B or C) and through S4 (alternate paths for A <-> B or C). However, due to shortest path first routing, the Subnet Manager does not provision 3-Switch paths (or 4-Switch paths) through S1 or S3.
- In this case the physical topology is unchanged, but Up/Down routing is used instead of Minhop routing. One Switch, in this case S1, is defined as the root switch. Using the Up/Down algorithm breaks the logical loop by disallowing paths S2->S3->S4 and S4->S3->S2. Refer to the article on Understanding Up/Down InfiniBand Routing Algorithm for more info.
In all of these cases, credit loops can be prevented by:
- Eliminating physical loops, or
- Including enough paths with three or more switches, so that a dependency loop can not form
A practical example of this approach applies to 2-tier (Clos-3) fabrics, where the best practice is not to attach servers to L2 Switches. Here is an example of a Clos-3 that has credit loops. Note that removing Nodes X and Y breaks all loops:
Note: IB Gateways at L2 in a Clos-3 fabric are an exception, because the amount of traffic they exchange is too small to complete a back-pressure loop. Other nodes that do not exchange much data among themselves can also be placed at L2, even though their connections to L1 nodes might be high-bandwidth. See also, VPI Gateway Considerations.
Loop Elimination by Using Routing Algorithms
There are three common IB routing algorithms that can be used:
- Minimum hop (minhop)
- Up/down (updn)
- Fat tree (ftree)
Note: Other routing algorithms are much less common and are outside the scope of this post.
A Minhop does not eliminate credit loops. Like all IB routing engines, it begins with a shortest-path spanning tree and attempts to balance all possible logical paths statically (between every node pair) across the available physical links. It fully connects all reachable nodes. Minhop is therefore a good default algorithm to use if the physical network is first being brought up, is being modified, or has been degraded by failed links or switches. For properly designed Clos-3 networks (two tiers of switch ASICs), where credit loops have been eliminated by the physical connections, a Minhop is a viable and simple choice for a production cluster, but not for anything beyond Clos-3.
The Up/Down algorithm is a common choice for larger clusters for a few reasons:
- It guarantees there can be no credit loops.
- It is fairly tolerant of topologies that are not classical ‘fat trees’, for example when combining IB clusters that previously were independent.
- It is recommended to use Routing chain feature to provide optimal routing to each combined IB clusters.
If you use the Up/Down algorithm, you are required to use a small SM configuration text file that defines the ‘root’ switches of the fabric. Because the algorithm disallows certain logical paths in order to break loops, it is possible that certain nodes might not be able to "ping" each other unless best practices for the Up/Down algorithm are followed.
Details about how Up/down routing works, and best practices for fabric design, are covered in Understanding Up/Down InfiniBand Routing Algorithm.
Fat Tree routing is a cousin of Up/Down routing. Like the Up/Down algorithm, it accepts a root switch list and guarantees a loop-free logical topology. However, Fat Tree can only route symmetrical (pure) and near-symmetrical fat trees so it is not useful for more creative tree-like fabrics. By focusing on nearly pure fat trees, the Fat Tree algorithm can do more global balancing of logical paths. Thus its routing is more efficient than what Up/Down can provide, due to up/Down’s more local balancing.
Note: Both Up/Down and Fat Tree algorithms attempt to operate in the absence of a root switch list. it is not recommended to use UP/Down without root switch list as correct auto-detection of the root switches is not guaranteed. Fat Tree is capable to run w/o root switch list only in pure theoretical fat tree topology. Otherwise it will fail. When the Fat Tree algorithm explicitly knows the roots, it reduces restrictions on the topologies to which it can route.
Other Loop Elimination Methods
Other techniques are outside the scope of this post. Note that DOR for Hypercube and Enhanced HyperCube and Torus2QoS algorithm, for 2d and 3D mesh and torus topologies, are credit-loop free routing engines.
Other Routing Topics
This post only covers unicast routing. Applications that rely on efficient or large-scale multicast operations require attention to the multicast routing algorithm, which is outside the scope of this post.