This post discusses the up/down InfiniBand routing algorithm.
This post is fairly basic. However, the reader should have a good understanding of networking and familiarity with InfiniBand concepts.
- opensm(8) - Linux man page
- Understanding the GUID Routing Order File (SM Configuration)
- Understanding the Root GUID File (SM Configuration)
- HowTo Prevent InfiniBand Credit Loops
- VPI Gateway Considerations
Several InfiniBand routing engines may be configured on a network such as Min Hop, Up Down, Down Up, Fat Tree and more (see opensm). Up/Down (UpDn) and Fat Tree are the most commonly used InfiniBand routing algorithms for Clos/fat tree networks.
Note: This includes trees built using director switches and 1U switches—the two levels of physical switch enclosure represent 3 tiers of switch ASICs because each director switches contains 2 tiers of ASICs.
Like most IB routing algorithms, UpDn uses the shortest path(s) available between any two endpoints. It can route any collection of IB-connected switches and HCAs. Most importantly and unlike MinHop, UpDn guarantees credit-loop free routing in the fabric. UpDn begins with a list of the switch ASICs that form the ‘root’ or top level of the fabric. This list is set with the Subnet Manager (SM) flag --root_guid_file. It is a simple text file with a line for each globally unique ID (GUID) of a root ASIC. Although UpDn has an option to auto-discover the root ASICs, it is strongly recommended that a root GUID list be supplied. The root GUID list must be updated if a root switch ASIC is replaced or if the topology is expanded, and every SM must have an identical copy of the GUID list.
To begin routing the fabric, the UpDn algorithm starts with the root switch ASICs—to which we will refer as Distance 0 (zero). The algorithm then finds every switch ASIC that is one hop (one link) away from the roots. These ASICs can be thought of as Distance 1, because they are one hop away from the root switches. The algorithm then discovers all switch ASICs that are two hops from the root switches, these can be thought of Distance 2. The process continues until every switch ASIC has been assigned a distance from the roots. The following diagram shows an example 3-tier fabric with the distances assigned.
This process generates a Breadth-First Spanning Tree (BFSP) which is analogous to the approach used by the Spanning Tree Protocol (STP) used in Ethernet. Unlike STP, UpDn allows multiple roots, and strives to provision as many paths as possible between each pair of end nodes. The UpDn algorithm then finds all of the possible shortest paths between every pair of endpoints. Next, UpDn discards any path that contains a hop from a Distance N ASIC to a Distance N+1 ASIC, followed by a hop back to Distance N. That is, it discards any path that goes "down" (away from the roots) and then "up" (toward the roots). Legal paths can go up, or down, or up and then down, or stay at the same level, but never down and then up. By discarding these paths and not provisioning them in the switches, UpDn guarantees no logical loops and no credit loops in routing that can lead to the traffic stoppage..
The following diagram shows examples of allowed and disallowed paths.
Note: The two potential paths between nodes E and F are both the same length (same number of hops) but only one obeys the UpDn rule. The disallowed path contains a DnUp segment.
The credit loop-free property of UpDn (and Fat Tree) routed topology is critical for reliable network operation.
However, since some potential paths are discarded, there are cases where a pair of end nodes can become disconnected and unable to communicate one to another.
The calculate_missing_routes opensm option when set to TRUE (the default value) in opensm configuration file guarantees connectivity between all endpoints in the fabric in credit loop-free manner with UPDN and Fat Tree routing.
For example, consider a different fabric that has nodes connected ‘above’ the leaf switches (nodes G, H, and J). Nodes connected to L1 switches (A, B, C, etc.) have legal UpDn paths to nodes G, H, and J. There is a legal UpDn path between nodes G and H. However, there is no legal path between G and J, and these nodes will not be able to communicate with each other. Setting calculate_missing_routes to TRUE will provide credit-loop free routing between all endpoints.
There may be cases where nodes do not need to communicate with each other (e.g. storage nodes that do not communicate among themselves). However, this is rare. The best practice for a Clos-5 3-tier fabric is not to connect nodes to the L2 switches.
Note: The diagrams above apply equally well to two different cases: A fabric built from 3 tiers of 1U switches, and a fabric that uses two director switches with 1U switches below them. In the latter case, nodes E, F, and G represent nodes cabled to the leaf modules of the director switches.
When assigning logical paths to physical links, the UpDn algorithm tries to map the same number of paths per link to maximize use of the available bandwidth. This balancing is done statically, without knowledge of actual workloads and traffic patterns. Path balancing decisions are made locally, at each switch, without assuming anything about the physical topology. The resulting path assignments may not be optimal for typical Clos/Fat Tree workloads.
A routing option called ‘scatter-ports’ is available for MinHop and UpDn routing engines. It instructs the routing algorithm to randomize the local assignments of paths to links, which often results in better link utilization. The scatter-ports option requires an integer argument, which is the seed for the random number generator. It is recommended to use a prime number for the seed; a seed of zero turns off randomization.
Note: scatter-ports configuration is available only on SM running on a host (or UFM), it is not supported in case the SM is running on a switch.
1. The routing engine algorithm is configured with the flag --routing_engine of the opensm command. The supported engines are: minhop, updn, dnup, file, ftree, lash, dor, torus-2QoS, dfsssp, sssp, pqft, chain.
In case you are using SM running on an InfiniBand switch, run the following command on the MLNX-OS CLI:
switch (config) # ib sm routing-engines ftree updn minhop
In case of an issue in the fabric, it is better to fall down to updn and not minhop. In case fat tree and updn can’t converge it will fail to minhop.
2. The list of roots for the UpDn routing algorithm is configured with the flag --root_guid_file of the opensm command.
In case you are using SM running on an InfiniBand switch, use the following command to set the list of root GUIDs.
switch (config) # ib sm root-guid <root-guid>
Doing that will force the routing algorithm to use those specific switches as root GUIDs.
How Do I find the root GUIDs?
a. Run ibswitches on the network (from a switch or from the host) to get the list of switches and their GUIDs. The GUIDs are marked in red below
mti-mar-sx21 [my-sm-cluster: master] (config fae) # ibswitches
Switch : 0xf45214030011e4f0 ports 36 "MF0;mti-mar-sx22:SX6036/U1" enhanced port 0 lid 2 lmc 0
Switch : 0x0002c903007fbbe0 ports 36 "MF0;mti-mar-sx21:SX6036/U1" enhanced port 0 lid 1 lmc 0
b. Filter the switches that are Spine switches in the cluster, and get their GUIDs
c. Run the command on the switch:
Note: you need to add ':' after each byte, same as MAC address
switch (config) # ib sm root-guid 0x00:02:c9:03:07:7f:bb:e0
switch (config) # ib sm root-guid 0xf4:52:14:03:00:11:e4:f0
In case, for example, you have 18 Spines and 36 leafs, it is recommended to run this command 18 times adding 18 spines GUIDs (on the SM switch)