HowTo Monitor InfiniBand Fabric for Unplanned Topology Changes

Version 2

    There are several methods to monitor the InfiniBand cluster typology. This post shows one of the easiest methods to detect all the topology changes, such as:

    • Additions of nodes
    • Nodes replacement
    • Lost Links
    • Changes in speed

     

    Overall Process Overview

    • Take a snapshot of the fabric when it's in a 'known good' state, and cache the results in a file
    • On a periodic basis, run a utility to re-scan the topology and compare it with the cached one
    • Update the 'known good' topology after a deliberate topology change is made

     

    The ibnetdiscover utility is the basis for this process, to create the 'known good' snapshot and to later compare the current topology with the cached version.

    The specific flags for ibnetdiscover are:

    -- cache      to scan the fabric topology and save it in a file

    -- diff            to scan the fabric topology and compare it with the cached version

     

    IB Fabric Monitoring Example

    The subject fabric for this example originally contains several switches and HCAs.

     

    Step 1: To initially capture the topology, run ibnetdiscover from one of the IB-connected servers. Note:, any IB host with the IB utilities (ibutils2) installed can be used.

     

    The 'cache' flag is invoked, with the name of a file (in this case ibnet_orig) to hold the topology information:

    # ibnetdiscover --cache /tmp/ibnet_orig

     

    While creating the topology cache file, ibnetdiscover will generate its usual textual topology description. The first few lines of the lab example—enough to show the first IB switch—are presented below:

    vendid=0x2c9            

    devid=0xc738

    sysimgguid=0x2c903005dd6b0

    switchguid=0x2c903005dd6b0(2c903005dd6b0)

    Switch  36 "S-0002c903005dd6b0"         # "MF0;switch-5eaf50:SX6036/U1" enhanced port 0 lid 10 lmc 0

    [16]    "H-0002c9030021f8f0"[1](2c9030021f8f1)          # "nissim2 HCA-1" lid 7 4xFDR     

    [19]    "H-0002c9030021f980"[1](2c9030021f981)          # "nissim1 HCA-1" lid 11 4xFDR

    [20]    "H-0002c9030021f9f0"[1](2c9030021f9f1)          # "nissim3 HCA-1" lid 12 4xFDR

    [21]    "S-0002c90300663c80"[5]         # "MF0;switch-5e1436:SX6506/L01/U1" lid 1 4xFDR10

    [25]    "H-0002c90300455ff0"[1](2c90300455ff1)          # "appliance2 HCA-1" lid 15 4xFDR

    [26]    "H-0002c90300e600c0"[1](2c90300e600c1)          # "appliance1 HCA-1" lid 16 4xFDR

    <snip>

     

    This IB switch has been assigned a Local ID (LID) of 10 by the Subnet Manager, and has connections to one other IB switch and 5 HCAs.

     

    Step 2: Use ibnetdiscover with the 'diff' flag to see if anything has changed.

    To make the example more interesting, first disable port 16 (for the purpose of this example) on the IB switch (shown above) whose LID is 10:

    #ibportstate 10 16 disable

    From the partial ibnetdiscover above, we know that switch port 16 is physically connected to an HCA named "nissim2 HCA-1"with a LID of 7.

    Now, run ibnetdiscover with the 'diff' flag against the topology file previously cached:

     

    # ibnetdiscover --diff /tmp/ibnet_orig  

     

    This results in the following text output:

    vendid=0x2c9                                    

    devid=0xc738

    sysimgguid=0x2c903005dd6b0

    switchguid=0x2c903005dd6b0(2c903005dd6b0)

    Switch  36 "S-0002c903005dd6b0"         # "MF0;switch-5eaf50:SX6036/U1" enhanced port 0 lid 10 lmc 0

    < [16]  "H-0002c9030021f8f0"[1](2c9030021f8f1)          # "nissim2 HCA-1" lid 7 4xFDR

    < vendid=0x2c9                   

    < devid=0x1003

    < sysimgguid=0x2c9030021f8f3

    < caguid=0x2c9030021f8f0

    < Ca    2 "H-0002c9030021f8f0"          # "nissim2 HCA-1"

    < [1](2c9030021f8f1)    "S-0002c903005dd6b0"[16]                # lid 7 lmc 0 "MF0;switch-5eaf50:SX6036/U1"  lid 10 4xFDR

     

    Insight

    1. The first portion of the above output indicates that the switch with LID 10 is still part of the IB subnet, its information isn't preceded by a broken bracket '<' . Its port 16 connection, however, is now gone, as indicated by the broken bracket.  None of the other switch ports have changed status, so they aren't listed.

    2. The second portion of the 'diff' output indicates that the HCA with LID 7 is no longer part of the IB subnet (its description is preceded by broken brackets).  Its Port 1 connection, to the switch, is now gone.

    All of these topology changes are expected, as the result of manually disabling switch Port 16.

    In general, whenever a link (cable) disappears, ibnetdiscover reports two topology changes: one for the port at each end of the affected link.

     

    To finish the example, re-enable port 16 on the Switch:

    # ibportstate 10 16 enable

    Now run ibnetdiscover with the 'diff' flag again:

    # ibnetdiscover --diff /tmp/ibnet_orig 

    The output is a single line indicating that the current topology exactly matches the cached one:

    <ALL GOOD>

     

    A script that runs ibnetdiscover periodically can easily check for this single line of output to determine whether the topology has changed.