This post supplies recommendations for the use of "good cabling" of InfiniBand (IB) director switches in a Clos-5 network.
When cabling IB switches, using consistency and symmetry in your setup is always recommended, but for director switches in a Clos-5 or higher network, attention to port-level cabling details is critical. A Clos-5, or 3-level fat tree, is usually implemented with director switches as core switches and 1U switches as the L1 (edge) switches:
In a Clos-5 fabric, the basic model is that each IB L1 switch below the directors sends the same number of uplinks to each director switch. The exact number of uplinks from an L1 switch to a director depends on the desired blocking level and the number of directors. As an example, for non-blocking with 36-port switch ASIC technology, there will be a total of 18 uplinks per L1 switch and these will be evenly divided among 1, 2, 3, 6, 9, or 18 directors depending on the fabric size. It is important not to treat a director as a ‘black box’ where every port is identical. This recommendation is no different from many Fibre Channel or Ethernet switches, for example where an Ethernet line module is non-blocking but the backplane is blocking.
An IB director switch is a 2-level fat tree (Clos-3) network in a box, consisting of spine modules and leaf modules, and this internal topology is directly visible to the Subnet Manager. A leaf module contains one or more switch ASICs. Each of these ASICs presents a number of external ports, for example via QSFP connectors, and is also connected to the spine ASICs via ports that are internal to the director chassis.
For example, a leaf module in a Mellanox FDR director contains a one switch ASIC and presents 18 external ports, while an EDR leaf module includes dual ASICs and presents 36 external ports (18 per ASIC).
For each uplink coming from a switch below the director, attention must be paid to the choice of switch ASIC within a leaf module. It is important to maximize bandwidth and minimize latency, and minimize congestion.
The question is: For the N cables coming from a given L1 switch to a given director, where should you connect them?
The next section describes a scheme for doing this, plus a few variations. This approach:
- Is field-proven
- Is easy to analyze
- Avoids inadvertent blocking that will otherwise be created and may be difficult to debug
- Minimizes traffic, and therefore dynamic congestion, on the director spine ASICs
Identical Division to Groups
The basic idea is to divide the L1 switches into groups (call them low-latency neighborhoods) of N switches and connect them so that traffic within the neighborhood stays within the same leaf module ASIC (introducing only one switch hop inside the director). Traffic between neighborhoods traverses the director spine ASICs (including three hops inside the director). The next diagram shows a highly simplified view of a fabric with two small directors, with the L1 switches divided into three neighborhoods of two L1 switches each.
As a more realistic example, assume a non-blocking fabric with three directors, with the usual 18 external ports per director leaf ASIC. Each L1 switch can send six uplinks to each director. The natural choice for a neighborhood size is 18 L1 switches. To cable the first of these neighborhoods, we allocate a set of six leaf ASICs on each director and cable each leaf ASIC to each of the 18 L1 switches. On each director, this neighborhood of 18 L1 switches occupies six leaf ASICs equaling 108 external ports. The following diagram shows one way to accomplish this for L1 (edge) switches specified as E1 through E18.
We can continue to add neighborhoods by selecting another 18 L1 switches and allocating another 6 leaf ASICs on each of the two directors. The final neighborhood might be incomplete because there are fewer than 18 L1 switches left, but the cabling scheme will be identical.
The next diagram continues the three-director example and shows two ways of cabling each 648-port director assuming six full neighborhoods (108 L1 switches).
- The choice of ports within a given leaf ASIC doesn’t matter, but consistency within each neighborhood and across all neighborhoods will make troubleshooting and expansion easier.
- The choice of which 6 leaf ASICs are used for a given neighborhood doesn’t matter, but having them adjacent (or some other obvious pattern) will make things simpler later.
The Last Neighborhood Example with two Directors
One implication of the scheme shown above is that the number of leaf ASICs should be a multiple of 6. If the last neighborhood only contains, for example, three L1 switches, those three switches have 18 ports towards the leafs. In this example we assume Two Directors, Each director should handle half of the 18 cables from each L1. 3*L1 switches * 9 ports each = 27 ports. Following the same logic from the example in the section above, the standard choice for the number of leaf ASICs for a neighborhood would have been 9 for two directors. However, provisioning 9 leaf ASICs results in creating many empty ports, which adds cost. This section discusses some options that require fewer leaf ASICs for the last neighborhood.
Suppose you have three L1 switches left over after forming multiple 18-switch neighborhoods. You needed 18 ports per director for the uplinks from these three switches, so you only bought two extra leaf ASICs (36 ports) per director Instead of adding a third (or more) ASICs, you can divide the uplinks from the three L1 switches according to the following rule:
- Each leaf ASIC must connect the same number of cables from every L1 switch.
Following the rule ensures that there is no intra-neighborhood blocking on any leaf ASIC.
Wiring Example A
You could connect all 9 cables across a single ASIC, this would create a neighborhood of 2 leaf switches. This will work, however, the implications would be inconsistent latency between E1-E2 communications and E1- everyone else. The second issue with Example A is that you have removed redundancy, the leaf blade is a single point of failure (for E3).
Wiring Example B
Wiring asymmetrically in InfiniBand creates issues that can be very hard to diagnose, on the surface you will assume you have 9 ports from each L1 connected over 2 ASICs. The asymmetry will introduce hot spots in the fabric across the 2 leafs. The top leaf has E1=5 wires, E2=4 wires , E3=5 wires, and the bottom leaf is E1=4, E2=5, E3=4. This creates a 5:4 ratio between E1 - E2, and E2 - E3. The 5 wire connections will allow more traffic into the ASIC than could be delivered through the 4 wires out.
Wiring Example C
While wiring example C, it is not pretty, but does meet all of the requirements. It keep symmetry within each ASIC and also provides redundancy. However the last 9 ports could not be used without introducing asymmetry again.
Wiring Example D
The wiring of D meets all of the requirements. It keep symmetry within each ASIC and also provides redundancy. However the last 9 ports could not be used without introducing asymmetry again.
Follow balanced cabling patterns instead of being creative. By changing this example above slightly, you can have three additional leaf ASICs (54 ports) and three L1 switches (see the next diagram). The design shown in the next diagram provides a much easier solution, since you can simply connect three cables from each L1 to each leaf ASIC. Expanding to six L1 switches is equally simple. Beyond that, expansion requires re-cabling.
Specialized Nodes and Shared Resources
The previous discussion treats all L1 switches as being interchangeable. Obviously each neighborhood can be tailored to contain end nodes that benefit from being ‘close’ to each other. For example a neighborhood might contain compute and GPU nodes, or compute and storage nodes. Although it is generally good to have neighborhoods configured as large as possible, sometimes it is better to use a smaller neighborhood scheme. This minimizes the need for topology-aware job scheduling.
A Clos-5 fabric often contains nodes that will be accessed by multiple neighborhoods, for example when you include storage nodes and storage gateways (SwitchX gateways are a special case).
The following figure is a variation of the highly-simplified neighborhood diagram shown earlier. It now includes a ‘special’ neighborhood for a shared resource, such as storage. This neighborhood has its own leaf ASIC on each director; in reality it could occupy multiple leaf ASICs per director. There could also be more than one L1 switch connected to these ASICs.
Shared Resource Node Properties
- Every other neighborhood can access it via the director spine ASICs.
- Each director provides a large amount of spine bandwidth traffic to/from the special neighborhood, which only competes with other inter-neighborhood traffic.
- Traffic to/from the shared resource nodes is typically limited by the shared nodes themselves, not by the directors. In the diagram this bandwidth is represented by the cable (bundles) W and X and the capabilities of the nodes (S) themselves.
- The size of this special neighborhood is often be smaller than the other traditional neighborhoods.
The next diagram shows the connector face of a director connected to four compute neighborhoods plus a small ‘I/O neighborhood’.
Access to storage from all other neighborhoods goes through the director spines. Access is uniform from the compute neighborhoods and bandwidth between each compute neighborhood and storage.
- If the aggregate storage bandwidth ‘below’ an L1 switch is low. For example due to rotating media or front-end limitations, the usual number of director uplinks might be overkill. Instead of 18 uplinks per L1 as in our non-blocking example, nine uplinks per L1 might provide sufficient bandwidth and resiliency for the storage nodes. The number of uplinks must still be the same for each director, but one can consume fewer director ports.
- Unlike compute nodes, the bandwidth needed among the resource nodes themselves can be relatively low. In these cases, concerns about blocking within the special neighborhood could be relaxed.
Lowering Latency to Shared Resources
The previous diagram showed a robust way to provide universal access to shared resources, but it requires five switch hops between a shared resource and its client. For low latency resources such as SSD drives or NVME storage, eliminating two switch hops might be significantly beneficial.
The diagram above shows three neighborhoods:
- The rightmost neighborhood is a SharedResources neighborhood as described above.
- Neighborhood 1 is the classical compute neighborhood, which accesses the shared resources via the director spines as previously described.
- Neighborhood 2 is a compute neighborhood, but its leaf ASICs are also connected to the L1 switch(es) of the shared resources, via Y and Z.
The Neighborhood 2 now has a shorter path to the shared nodes — three switch hops from S nodes to client nodes (for example C and D).
- Bandwidth from Neighborhood 2 to the S nodes is completely determined by the number of ‘direct’ cables Y and Z. Because InfiniBand routing always uses the shortest paths, the other potential paths to S from clients C and D will never be used (they involve five switch hops).
- Nodes in Neighborhood 1 can also use the ‘direct’ path, because this five-hop path is just as short as paths through the director spines. This sharing might not be what was intended, but could be prevented by storage masking:
- One option is to separating the S nodes onto different L1 switches (e.g. putting them in Neighborhood 1
- Second option is by connecting Neighborhood 1 nodes directly to L1 switch (the most right switch) that is connected to Shared Resources Neighborhood.
- To provide leaf ASIC ports for cables Y and Z, the number of ‘compute’ L1 switches in Neighborhood 2 must be reduced. For example, Neighborhood 2 might include only 17 compute L1 switches (and the compute nodes attached to them) instead of 18.
There are other clever (and not so clever) possibilities for connecting shared resources that will not be covered here. Keep these points in mind when analyzing alternatives:
- IB routing always uses the shortest paths.
- Clos-5 fabrics must use either Up/Down or Fat Tree routing, to avoid credit loops. This will disallow many potential shortest paths between nodes. Refer to Understanding Up/Down InfiniBand Routing Algorithm.