This post discuss consideration IT architect needs to take into account in the deployments scenarios.
- HowTo Configure InfiniBand Gateway HA (Proxy ARP)
- HowTo Configure IPoIB Networks with Gateway and Multiple PKEYs
A VPI gateway looks exactly like a switch from the outside, but its logical behavior is more complex. This additional complexity must be taken into consideration during fabric design. Unless this is done, serious issues can result. For example, aggregate bandwidth may be low, or disruptive fabric changes may be required, or unexpected credit loops may be created. From the InfiniBand fabric perspective, ‘in band’, a gateway consists of an InfiniBand switch plus an internal HCA. The HCA is sometimes referred to as a TCA, and is connected to port N+1 on the switch ASIC.
For example, the TCA is connected to port 37 on a 36-port switch (N=36). It is easy to see this using fabric tools such as ibnetdiscover.
The gateway TCA is of course capable of much higher throughput than normal HCAs. It can drive many ports’ worth of bandwidth between the InfiniBand side and the Ethernet side.
The SwitchX-2 implementation of the gateway provides a single LID for the TCA. This fact, plus the fact that a gateway has an HCA ‘hidden’ in the box, are the reasons care must be taken when gateways are used in a fabric.
The following terms are used for this discussion:
- Switch: for this discussion, 'switch' refers to an InfiniBand switch ASIC, not to an enclosure. Recall that an InfiniBand director for example is an enclosure containing multiple switch ASICs.
- Gateway: a switch running the VPI gateway function. A gateway may or may not be connected to nodes (servers).
- Target Channel Adapter (TCA): a specialized Host Channel Adapter (HCA).
- Clos-3: A richly-connected fabric built from two tiers of IB switch ASICs. The '3' refers to the maximum number of switch hops. Named after Robert Clos of Bell Labs, Clos networks are often referred to as 'fat trees'.
Consideration 1: There’s an HCA in the Box
Because a Gateway is logically a switch plus an HCA, its placement within a fabric is somewhat more restricted than if it were simply a Switch.
Gateways can be used at L1 (leaf level), with or without nodes connected to them. Any nodes connected to a Gateway at L1 will have full bandwidth, low latency access to
the Gateway. This can be useful, for example if there are storage nodes that must move a lot of data through the Gateway.
For FDR fabrics, it often makes sense to place Gateways at the spine level. Some or all of the spines can be Gateways, depending on the desired resiliency and aggregate Gateway
bandwidth. The use of spine Gateways does reduce the number of InfiniBand spine ports, which reduces the maximum size of the InfiniBand fabric. The Subnet Manager has a feature called “calculate_missing_routes” which is TRUE by default; when configured with UPDN or FTREE routing engines it will create routing without credit loops even when internal Gateway HCAs are at the spine level.
For EDR fabrics, Gateways can also be placed as spines but it requires using a Routing Chain in the opensm configuration to avoid using FDR links for compute-to-compute traffic. The Routing Chain configuration should be as follows:
- UPDN routing for the whole cluster (all spine
switches are roots)
- UPDN or FTREE routing for only the EDR part of
the fabric with only the EDR spine switches as roots
The following diagram shows a small Clos-3 fabric with two gateways (G1, G2) as Level 2 switches and four switches (E1, .. ,E4), running the MinHop routing algorithm.
Diagnostics will report a potential credit loop, but because inter-gateway traffic is low there won't be deadlocks.
Consideration 2: One LID per Gateway
If more than one InfiniBand port’s worth of bandwidth is required through a gateway, its single-LID property becomes a significant consideration.
This is due to InfiniBand’s destination-based forwarding mechanism. An InfiniBand packet destined for the gateway will have the gateway TCA’s Local Identifier (LID) as its Destination LID (DLID).
When the InfiniBand packet arrives at switch, the DLID is used as an index into the switch’s Linear Forwarding Table (LFT). The LFT returns a port number through which the incoming packet will be sent to the Gateway.
Here is an example of an LFT, each DLID has its own egress port:
|DLID Index||Egress Port|
Because all InfiniBand packets sent to a gateway will have the same DLID (remember one TCA), The InfiniBand traffic to a gateway has the following key property:
When traffic bound for a given gateway passes through an InfiniBand switch, it can only ever use one port of the switch to get to the next hop.
Inbound InfiniBand traffic from a gateway has a much richer set of paths, because the gateway HCA knows a different destination LID for each of the IB nodes. The usual considerations of InfiniBand routing, such as the choice of routing algorithm, apply.
This simple property has significant implications for InfiniBand-to-Ethernet traffic.
- Parallel cables from a switch to a gateway won’t add bandwidth. Only one cable will be used.
- From a leaf switch, nodes sending traffic to a gateway will all contend for a single InfiniBand uplink port.
- Having a gateway connected to N switches does not always mean that N paths will be available for gateway traffic. For example, consider a Clos-3 having 6 leafs and 2 spines, with a gateway connecting to 3 leafs. For Gateway traffic sent from the other 3 leafs, there will be only 2 (spine) paths to the gateway.
The following diagram will illustrate some of these implications. It shows a Clos-3 fabric with a gateway connected to 3 Level 1 switches, and 5 nodes connected at various points.
In the example fabric, for traffic going to gateway G1:
- One cable from each of switches E1, E2, and E3 will be used
- All Nodes connected to switches E4, E5, and E6 will contend for two of G1's IB cables (not three cables), because there are only two Level 2 switches (C1 and C2)
- All nodes connected to E1 (for example) will contend for a single gateway cable
- All nodes connected to E4 (for example N2 and N3) will contend for one uplink to Level 2, to either C1 or C2
- All nodes connected to G1 (for example N1) will have full bandwidth through the gateway
If aggregate InfiniBand-to-Ethernet gateway bandwidth needs to exceed 56Gb/s, it’s necessary to:
- Connect a gateway to multiple IB switches, and/or
- Use multiple smaller gateways
- Connect I/O-intensive nodes such as data movers directly to gateways
- Use the scatter_ports option (for Minhop or Up/Down routing) to insure path distribution across the available paths3
Conclusion: Gateways require planning!
Whether designing a new InfiniBand fabric or extending an existing one:
- Remember that a Gateway is a switch with a high-capacity HCA built in
- Be sure to understand the aggregate InfiniBand-to-Ethernet bandwidth requirements, initially and in the future.
- In a multi-Gateway scenario, consider the bandwidth impact of a Gateway failure.
- If a Gateway needs to handle more than 56Gb/s, pay careful attention to the implications of its single LID property