IB Router Architecture and Functionality

Version 32

    InfiniBand (IB) routers are intended to be used to segment a very large network into smaller subnets connected by an IB router.

    The segmentation may be useful for isolating some of the subnets from each other, or for building a very large network.

    This post discusses IB router architecture and functionality .



    • SM: Subnet Manager. The SDN controller of the InfiniBand network
    • SA: Subnet Administration. The software handling the in-band northbound interface of the SM. Implement a service by which InfiniBand client software can query and interact with the SM
    • OpenSM: InfiniBand-compliant Subnet Manager and Administration software
    • OpenMPI: Open Message Passing Interface implementation
    • SRQ: Shared Receive Queue. A method to reduce receive buffer resources by sharing the receive buffers by multiple QPs
    • Per Peer QP: Per Peer Queue Pair (QP)
    • LIDs: Local Identifier. The L2 address used by InfiniBand (allocated by the SM)
    • DLID: Destination LID
    • multi-swid: Multi Switch-ID. A virtualization of multiple switches on top of a single InfiniBand switch. 
    • P_Key: Partition Key. The InfiniBand way to restrict send/receive or forwarding of particular traffic (similar but different than VLANs)






    IB Router is used mainly to support the following needs:

    • Subnet isolation, which enables you to build smaller subnets isolated by routers, to gain faster SM response time and optionally prevent traffic to cross between all nodes.  In common use cases you can share a storage network among multiple nodes, isolated from each other, and other compute subnets.
    • Clusters that include a > 42K hosts cluster

    Mellanox IB routers perform algorithmic routing, which is able to obtain the last-hop L2 address from the L3 address, and thus avoids the overhead of L3 to L2 table lookups. The routing is therefore simple and fast, can be performed with very small latency overhead and at line rate. In this post we describe the architecture aspects of the solution.

    To learn how to configure IB router, refer to HowTo Configure IB Routers.


    Important Notes/Limitations

    • The first Mellanox OpenSM version with router support is release 4.7.0 with UFM 5.6 and Mellanox MLNX-OFED version 3.3
    • Switch based OpenSMs starting with MLNX OS version 3.6.200 are router aware as is SM coming with the MLNX_OFED version 3.3 or with the UFM 5.6
    • Only single hop routing is supported. In other words, routed traffic may cross at most a single router
    • No multicast traffic across subnets. It is planned for a later phase
    • ConnectX-3, Connect-IB and ConnectX-4 are the only router-aware HCAs. However, older HCAs can still be used within each of the IB subnets, they just will not be able to send/receive routed traffic
    • ConnectX-3 (and Pro) does not support the case where the path from client to server uses a different router than the path from server to router. This is because they implement IBTA spec 1.2 and perform SLID check on incoming traffic. The same compliance statement was modified in release 1.3 to require ignoring the SLID check when GRH is present. This limitation of the devices does not prevent them from sending/receiving traffic that crosses the router as long as it is "router reversible" i.e. the same router is used for both directions. The connections established by librdmacm are "router reversible" by nature.
    • IB subnet size is limited by the number of LIDs. With LMC=0 it is 0xBFFF =~ 48000. With LMC>0 you need to divide the number by 2^LMC
    • IB Router system SB7780, which is based on Switch-IB (click here for details), utilizes MLNX-OS version 3.6.0502, or later. It has the following limitations:
      • An IB router cannot run embedded SM and all related IB tools
      • An IB router cannot run switch-based MPI collectives aggregation nodes (Mellanox SHARP technology)
      • An IB router cannot run Adaptive Routing
    • Storage isolation is now supported with latest MLNX_OFED 3.4
    • Running MPI at scale is targeted for 2017.


    Single Hop Topologies


    Single hop topologies are network topologies that assume that each of L3 connections are required between 2 subnets, they must be connected by at least one router, as shown in Figure 1.


    Figure 1- Single Hop Topology




    When there are two subnets that are not connected to each other by a router. When multiple router hops are required for traffic to reach from one to the other, we say the topology is multi-hop.

    Under IB Routing as of May 2016, these subnets will not be able to communicate to each other.


    Figure 2- A Multi-Hop Topology with Two Subnets
    L3 routing between these subnets is Not Supported





    Network Topology Design

    In this section we provide some basic rules for designing a topology that incorporates multiple subnets connected by IB Routers.

    1.    Credit-Loop freedom:

    When L3 traffic is introduced, and since the routers are loss-less, we must make sure there are no buffer dependency loops formed by traffic that crosses the routers (credit-loops).
    Credit-loop freedom is guaranteed within each subnet by the SM which prevents credit-loops from being formed. However, when we connect subnets to each other, there is a risk for such dependency loop to be created involving multiple traffic flows that cross the routers. To avoid credit-loops, a detailed and accurate design is required as it may involve using InfiniBand Virtual Lanes and Service Levels to support a diverse set of topologies.
    However, a simple rule that rely on the principle of Up/Dn routing avoids the problem without any need for advance features by restricting the possible topologies space. According to that rule the topology must maintain the concept of "level" such that an "up" direction can be clearly defined.
    When such direction is defined, traffic through routers may not perform any "down and then up" turns, enough to avoid any credit loop.
    We provide 2 optional simple schemes of such topologies: a) for the case of a new cluster and b) for when a common subnet is connected to multiple possible pre-existing subnets.

    a.    One type of topologies that preserve this rule require IB Routers to be placed at the top of the topology. Figure 3 shows such topology.

    Note that in this case routers are connected to each subnet switches that are placed at the "top" of the subnets.
    Since this option requires having free connections at the top of the subnets, where the routers are connected, it fits nicely the case when the entire topology is designed at the same time.

    Figure 3a - First optional simple topology place routers at "top"




    b.    An alternative topology may allow for a single subnet to connect to a set of subnets that are isolated from each other.

    It is a simple solution to the case where existing subnets to be connected to a common storage subnet.

    Only the new common subnet is required to provide "top" ports. The Up/Dn direction is maintained since the old subnets are placed at the top of the topology and connect to the routers with ports that were probably previously connected to hosts. The new common subnet connects to the old subnets via ports at the top of the subnet. Since no traffic enters the common subnet (going down) and leave to the other subnets (up again) there is no credit loop possible.


    Figure 3b - Second optional simple topology place routers at "top" of common subnet and below the old subnets



    NOTE: the figures 3a and 3b show a case where all routers connect to all subnets. This is NOT a requirement: a router may connect to a subset of the subnets.


    2.    Ensure the ports used by each subnet are in the same group of router ports (with the same subnet_prefix)

    The IB Router system requires configuration of the grouping of ports and subnets

    3.    Ensure that you have a sufficient number of routers between the subnets to maintain the desired bandwidth

    4.    OpenSM routing engine chains provides many options for routing topologies that a single engine cannot support


    Note: Routers can connect fat-tree, torus, and mesh topologies without the use of routing chains, but within each subnet the routers do need to be a valid part of each local topology.



    If you want to allow for administrative control over which subnets can talk to each other, you can further prevent communication even between subnets even if there is a router connecting them. This may be a cost-effective solution as it allows you to use a single router but prevent communication between some of the subnets connecting to it. For example see the three subnets S1, S2 and S3 illustrated below. You need to decide which subnets are allowed to communicate and allocate a single globally-unique P_Key to be used for that communication. Make sure that subnets that should not communicate either have no common router or have no common P_Keys assigned to the router ports. The actual P_Key assignments are performed by the SM’s and are configured via the partitions.conf file on each subnet SM.


    Note 1: If you want to have two subnets talk to each other, they must share the same P_Key number. The IB Specification does not allow changing the P_Key across subnets.

    Note 2: It is not possible to route packets on two different P_Keys on the same subnet, or different subnets.


    Figure 4- P_Key Number Sharing




    As of May 2016, IB Routers do not include an internal IP/IPoIB router (just IB-router).  However, it is very common for management, storage applications to rely on IP connectivity for making connections.

    In order to support IPoIB communication between subnets you will need to rely on a secondary Ethernet network or use IP routers.

    If you do not want to have a secondary network, you first need to set up dedicated IPoIB subnets on each IB subnet (which is created by selecting a different range of IPs), and then place IP routers in between subnets. Each IP router can connect several subnets.  Since the IP routers do not carry bandwidth or latency critical traffic they can be built using Linux boxes with IPoIB interface on each subnet. You may want to refer to the tutorial on how to make a Linux box into IP router, which can be found at http://www.tecmint.com/setup-linux-as-router.



    • IPoIB traffic does not cross the IB Router since it does not carry the GRH header
    • The user has an option, although it is not recommended to be used. This involves making all the subnets the same IPoIB subnet. As an alternative, we recommend that users set up a different IPoIB subnet on each IB subnet
    • IP routers that connect to all subnets will perform the IP routing


    Algorithmic Router Architecture

    In order to simplify the router implementation and provide full wire speed and lowest latency, Mellanox introduced the concept of Algorithmic Router.

    The main idea is that the router can avoid the need to perform lookup and learning of the L2 address (LID) by the L3 address (GID).

    Such lookup is required on the last hop of the L3 forwarding when the packets reach the final subnet and thus needs to go through L2 forwarding to their final destination.

    The Algorithmic Router performs simplified GID (L3) to LID(L2) mapping.

    Switch-IB implements a simple algorithmic routing mapping function that extracts just the LID out of the GID.

    This relies on a simple function which sets LID as 16 LSB bits of the GID.

    So the GIDs that are used for traffic that has to cross the router are denoted "Algorithmic Routable GIDs" and are described in Figure 6.

    Other parameters of the L2 address vector, like the P_Key, SL, MTU and Rate are not flexible in Switch-IB based algorithmic router.

    For these fields the outgoing packet uses the same values provided by the incoming packet L2 header.


    Figure 6- Routable GID Format


    See also: LRH and GRH InfiniBand Headers.

    The algorithmic router uses the subnet prefix value and the LID value extracted from the GID, and perform simple lookup for the destination port to egress.


    How does IB Routing Work? A step by step description

    1. Network Setup

    1. During the setup of the network each subnet OpenSM has to allocate both LIDs and Routable GIDs to the end-ports.
    2. As of May 2016 the MOFED solution relies on ibacm to provide the IP to GID resolution.
      Pre-populated ibacm caches are required to be distributed to all end hosts with the mapping of IP to the routable GIDs.
    3. Name to IP resolution can be performed using a DNS or /etc/hosts file. In both cases the mapping of name to IP is required to be defined.


    To support these 3 setup tasks MOFED provides a set of scripts ib2ib* that support a methodology for how to collect the GUIDs and IPs from each subnet and prepare the SM guid2lid and
        ibacm cache files as well as the /etc/hosts and dhcp.db.


    2. Name Resolution

    As described above the name to IP resolution can be performed using a DNS or /etc/hosts file.


    3. Connection Establishment

    Once the IP of the destination is obtained the application should invoke librdmacm which further uses ibacm service or the Kernel is provided a hook to resolve using ibacm if exists.

    The information provided in the connection request has to hold a Path Record from the local source HCA port through the router and finally to the destination host port.

    So first resolution is to find the routable GID of the destination and then the router L2 address to forward the traffic to.

    Once resolved a connection request can be sent to the remote node CM (over QP1) to initiate the connection.

    The connection manager (CM) residing on the other subnet node, normally require the reverse PathRecord, from its node to the request originator, to be embedded in the connection request.

    However, when the originator port is not on the same subnet as the CM node, it actually avoids these fields and uses the information provided in the packet headers instead. Such that the reverse PathRecord is not required.


    4. IP to GID Address Resolution

    Resolving IP to GID is based on ibacm cache for the May 2016 release. The cache file is populated and provided to all the cluster nodes in the setup stage.

    When librdmacm is invoked it first try to call ibacm to perform the resolution and ibacm then try to locate the IP to GID record in its cache.


    5. Next Hop (L2) Address Resolution

    Before sending any InfiniBand traffic the client application or Kernel module has to obtain a PathRecord which describes the L2 address of the destination.

    A PathRecord is obtained from the Subnet Administrator (SA) by providing the source and destination GIDs.

    It is important that the provided Destination GID will include the subnet prefix of the destination as well as its GUID.

    The Router ready OpenSM inspects the possible routers that connect to the destination subnet and may further filter them by some criteria provided in the Router policy file or within the PathRecord query.

    For example a query that provides a specific P_Key will only allow routing via routers that support that P_Key on both subnets ports.

    The SM then performs a destination based routing and selects which of the possible routers will carry the traffic and provides its LID as the DLID in the returned PathRecord.


    6. Sending routable traffic to the network

    Sending the traffic with the correct Routable SGID is required such that the receiver node on the other side of the router can perform the PathRecord and reply.

    The InfiniBand specification provides means for the SM to configure the subnet prefix of each port. It also allows the SM to associate the multiple GUIDs to a port.

    The question, though is how the device knows which of these GUIDs to use when sending the packets.

    The answer is that in order for the librdmacm and other Kernel clients to use the correct GUID we are required to associate the IPoIB of the IB port with that particular routable GID.

    This setting is performed during the setup phase.


    7. Forwarding through the Router

    For single hop routing the router itself is performing the minimal task of replacing the DLID with the destination DLID which is extracted directly from the DGID available in the packet GRH.