Scalable Subnet Administration (SSA)

Version 25

    This post discusses the Scalable Subnet Administration (SSA) solution for InfiniBand clusters. This post assumes familiarity with InfiniBand (IB) terms especially those related to IB management.

     

    References


    Introduction

    Subnet Administration (SA) is defined in the IB architecture in volume 1 chapter 15 of either IBA 1.3 or 1.2.1 standards. SA includes both query and event forwarding subsystems. The query subsystem is to respond to various queries by end ports in the subnet, most notably path records (PRs) but many other records as well. The event forwarding subsystem forwards SM received traps and notices to subscribed parties. SA is logically part of the SM and is typically implemented as tightly coupled with SM as in the OpenSM implementation.

     

    SA has scalability issues due to it's centralized nature. The classic issue is when all nodes want to communicate with all other nodes this creates an O(N2) load on the SA. This is what happens with MPI all to all connectivity. It is even worse as this can be every CPU core within a node doing this. On a 40,000 node subnet, all to all is 1.6 billion path records assuming single core CPUs. Assuming that the SA can service 50K path records per second, this would take 9 hours.

     

    Scalable SA (SSA) turns this into a distributed problem by distributing the needed data to perform the path record calculation needed for a node to connect to another node and caching these locally in the compute (client) nodes.

     

    SSA is composed of several user space software modules.

     

    SSA forms a distribution tree with up to 4 layers. At the top of the tree is the core layer which is co-resident with the OpenSM. The next layer in the tree is the distribution layer, which fans out to the access layer. Consumer/compute node (ACMs) are at the lowest layer of the tree and connect to access layer nodes. The size of the distribution tree is dependent on the number of compute nodes.

     

    SSA distributes the SM database down the distribution tree to the access nodes. The access nodes compute the SA path record  half-world database for their client (compute) nodes and populates the caches in the ACMs. Half-world means paths from the client (compute) node to every other node in the IB subnet.

     

    The SSA tree architecture is defined in the following figure:

    SSA arch slide.jpg

    The distribution tree fanouts for 40K compute nodes will be as follows:

    • 10 distribution per core
    • 20 access per distribution
    • 200 consumer per access

    SSA arch slide 2.jpg

    For smaller configurations, core and access can be combined. Distribution and access can also be combined. Distribution layer can be omitted as well.

     

    The recent SSA release (0.0.9 for upstream and 0.0.9.1 for MOFED 3.1) features include:

    • IP address support
    • admin support

    in addition to previous features.

     

    Kernel IP support allows for the IPv4 ARP and/or IPv6 neighbor caches in the kernel to be prepopulated by SSA ACM.

     

    admin support consists of ssadmin utility and admin support in SSA nodes. The ssadmin utility is used to monitor, debug and configure the SSA

    layers: core, distribution, access, and ACM.

     

    0.0.9 also now supports ConnectX-4 (as well as Connect-IB). This was a limitation in the previous 0.0.8 release(s).

     


    The SSA release (0.0.8 for upstream and 0.0.8.1 for MOFED 3.0) features include:

    • Initial SSA bring up in arbitrary order based on distribution tree rebalancing

    • OpenSM failover/handover

    • Non-core node resilience based on rsocket keepalive

    • Distribution tree rejoin/reconnect

    • Multithreaded PathRecord computation in access layer

     

     

    When to use SSA

     

    SSA was designed for large subnet (40K compute nodes) but can also be deployed in smaller subnet configurations which have a high SA query rate.

     

    SSA should be considered whenever the SA load is too much for the SM in use. This is determined by time outs to the SA queries for the path records. On a host based SM, this is typically around 50K path records per second. On a managed switch SM, this is much lower.

     

    Using the nominal example of MPI all to all communication is a 40K node subnet, where SA will handle 50K path records per second, each ACM will handle at least that number so the subnet wide number is 40K * 50K or 2 billion path records per second.

     

     

    Requirements

     

    There are 2 key requirements areas for SSA: kernel and user space. The kernel and user space packages needed are different based on whether upstream or MOFED 3.1/3.0 is being used.

     

    Upstream Kernel Requirements

     

    Kernel

     

    SSA requires a kernel which contains AF_IB address family support.
    This is supported in the upstream kernel 3.11 so any kernel/distro 3.11 or later is sufficient.

     

    Known distros with recent enough kernels for SSA:

    • Fedora Core (Rawhide, FC19 or later)
    • OpenSuSE 13.2 uses 3.16 going for 3.17
    • SLES 12.0 is 3.12 based
    • Ubuntu 14.04 is 3.13 based
    • Ubuntu 14.10 is 3.16 based
    • Ubuntu 15.04 is 3.19 based

     

    Note that both RHEL 7.1 and RHEL 7.0 use 3.10 so these do not support SSA

     

    SSA was tested on: Ubuntu 12.01.1, Kernel version: 3.12.0-031200-generic

     

    Stable kernels with the below recent patch for SSA include 3.14, 3.18, and 3.19:

     

    commit c2be9dc0e0fa59cc43c2c7084fc42b430809a0fe

    Author: Ilya Nelkenbaum <ilyan@mellanox.com>

    Date:   Thu Feb 5 13:53:48 2015 +0200

     

    IB/core: When marshaling ucma path from user-space, clear unused fields

    When marshaling a user path to the kernel struct ib_sa_path, we need to zero smac and dmac and set the vlan id to the "no vlan" value.

     

    This is to ensure that Ethernet attributes are not used with InfiniBand QPs.

     

    Fixes: dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")

     

    Signed-off-by: Ilya Nelkenbaum <ilyan@mellanox.com>

    Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

    Signed-off-by: Roland Dreier <roland@purestorage.com>

     

     

     

    User space packages

     

    Upstream packages are available on OFA management downloads and SSA github.

     

    • libibumad 1.3.10.2 or later
    • OpenSM 3.3.17 or later
      • If not running PerfMgr, OpenSM 3.3.17 or later is sufficient.
      • If running PerfMgr, OpenSM 3.3.19 is needed.
    • libibverbs 1.1.8
    • librdmacm 1.0.20 (AF_IB and keepalive support) or later.
      • Note that librdmacm contains 4 AF_IB capable examples: rstream, ucmatose, riostream, and udaddy.
    • glib-2.0
    • HCA libraries
      • libmlx4 1.0.6
      • libmlx5 1.0.2

     

     

     

    MLNX_OFED 3.1/3.0 Requirements

     

    SSA MLNX_OFED 3.1/3.0 has only been tested on SLES 12.

     

    MLNX_OFED 3.1/3.0 contains the versions of all upstream packages which include SSA support.

    Note that ACM is named ibacm_ssa to distinguish it from original ACM (ibacm).

     

    SSA Packages

     

    SSA contains the following packages:

    • OpenSM SSA plugin (libopenssa)
    • ibssa executable (for distribution and access nodes)
    • ibacm executable (for consumer/compute nodes)
    • ssadmin executable (for any SSA node) - starting with 0.0.9 release

    Also, scripts and configuration files

     

     

    OpenMPI with AF_IB Support

     

    Not included with SSA so needs building and installing

    Part of upcoming OpenMPI 2.0 release

    On mainline of OpenMPI github tree

    To build OpenMPI for use with SSA, configure as follows before building:

     

    ./configure --enable-openib-rdmacm-ibaddr --enable-mpirun-prefix-by-default --with-verbs=/usr/local --disable-openib-connectx-xrc

     


    For MLNX_OFED 3.1/3.0, the following should be used:

     

    ./configure --enable-openib-rdmacm-ibaddr --enable-mpirun-prefix-by-default --disable-openib-connectx-xrc

     

     

    Building and Installing SSA

     

    On core nodes, libibumad, OpenSM, libibverbs, librdmacm, and HCA specific library must be built and installed prior to libopensmssa.

     

    On distribution or access nodes, libibumad, libibverbs, librdmacm, and HCA specific library must be built and installed prior to SSA.

     

    On consumer nodes, libibumad, libibverbs, librdmacm, and HCA specific library must be built and installed prior to ACM.

     

    Once the prerequisites are built and installed, the relevant SSA tar ball(s) is/are then built and installed via:

    ./autogen.sh && ./configure && make && make install

     

    in libopensmssa, distrib, and acm directories.

     

    OpenSM (on core nodes) needs to be configured as follows in Opensm configuration file (typically opensm.conf):

     

    # Event plugin name(s)

    event_plugin_name opensmssa

     

    # Options string that would be passed to the plugin(s)
    event_plugin_options (null)

     

    SSA configuration is then performed as follows:

    • Core nodes have ibssa_core_opts.cfg file
    • Distribution nodes have ibssa_opts.cfg file.
    • ACM/consumer nodes have ibacm_opts.cfg file.
    • IP support is configured in core nodes via ibssa_hosts.data file.

     

    Follow instructions in those files.

     

    On ACM nodes, ib_acme can be run with -A and -O options to generate ibacm_opts.cfg and ibacm_addr.data files for that machine/system. This is only needed to be done once (at initial install time).

     

    Kernel IP Support

     

    To iincrease max size of kernel neighbor cache depending on size of hosts file being used,

    gc_thresh3 is also the maximum value of ARP entries that can be kept in the table.

    See http://blog.lachmann.org/?p=204 and http://linux.die.net/man/7/arp

    Default: /proc/sys/net/ipv4/neigh/default/gc_thresh3
    1024

    To increase: sysctl -w net.ipv4.neigh.default.gc_thresh3=49152

     

    In ACM, there is neigh_mode option in ibacm_opts.cfg:

    # neigh_mode:

    # Specifies whether IPv4 and/or IPv6 user space cache

    # is synchronized with kernel neighbor cache

    # 0 - no sync with kernel (default)

    # 1 - sync IPv4 neighbor (ARP) cache

    # 2 - sync IPv6 neighbor cache

    # 3 - sync both IPv4 and IPv6 neighbor caches

     

    Also, there is support_ips_in_addr_cfg option there too:

     

    # support_ips_in_addr_cfg:

    # 1 -  read IP addresses from ibacm_addr.cfg

    # Default is 0 (no)

     

    In core node, there are the following options:

    # addr_preload:

    # Specifies if the address resolution records should be preloaded

    # and attached to generated SMDB, that will be further pushed to

    # SSA downstream nodes. Preloaded records will be stored in ACM

    # clients address cache.

    # Supported preload values are:

    # 0 - don't preload

    # any non-zero value - preload addr_data_file

     

    # acm_preload 1

     

    # addr_data_file:

    # Specifies the location of the address data file to use when preloading

    # address resolution records.  This option is only valid if addr_preload

    # option is on (non-zero value).

    # Default is RDMA_CONF_DIR/ibssa_hosts.data

     

    # addr_data_file /etc/rdma/ibssa_hosts.data

     

    Format of ibssa_hosts.data file is identical to ibacm_hosts.data file:

    #

    # Entry format is:

    # address IB GID [<QPN> [<flags>]]

    #

    # The address may be one of the following:

    # host_name - ascii character string, up to 31 characters

    # address - IPv4 or IPv6 formatted address

    #

    # There can be multiple entries for a single IB GID

    #

    # QPN    - remote QP number (optional)

    #

    # flags  - 8 bits that indicate the connected modes supported

    #          by the remote interface:

    #              bit 7 specifies "reliable connected" (RC) mode

    #              bit 6 specifies "unreliable connected" (UC) mode

    #              bits 5-0 are reserved and MUST be set to 0

    #

    #          * if no QPN was specified, flags should not be specified as well

    #          * in case of only QPN specified, flags will get default 0x80 value

    #

    # All entries are divided into pkey sections. Before each section '[pkey=xxxx]'

    # will specify the pkey value (should be in hex) for section entries.

    # If empty '[]' appears then subsequent entries (till the next pkey section)

    # will have the default pkey value: 0xffff. The same goes for entries without

    # section specified (before any pkey section has started).

    #

    # Samples:

    #

    # luna3                   fe80::8:f104:39a:169

    # fe80::208:f104:39a:169  fe80::8:f104:39a:169

    # 192.168.1.3             fe80::8:f104:39a:169  0xaabbcc

    #

    # [pkey = 6FFF]

    # 192.168.1.4             fe80::8:f104:39a:169  0xaabbcc 0x80

    #

    # []

    # 192.168.1.5             fe80::8:f104:39a:169  0xaabbcc

    #

    # [pkey = 0x7FFF]

    # 192.168.1.6             fe80::8:f104:39a:169  0xaabbcc 0x80

     

    Note that host names are not currently verified in 0.0.9 releases (only IPv4 and IPv6 addresses).


    Note also that IP and name tables in ACM caches do not have records removed currently. They are

    currently only added or replaced.

     

     

     

    Admin Support

     

    admin support consists of ssadmin utility and admin support in SSA nodes. The ssadmin utility is used to monitor, debug and configure the SSA

    layers: core, distribution, access, and ACM.

     

    See ssadmin man page for more details.

     

     

     

     

    Known Limitations/Issues

     

    • Only x86_64 processor architecture has been tested/qualified.
    • Only single P_Key (full default partition - 0xFFFF) currently supported.
    • Virtualization (alias GUIDs) and QoS (including QoS based routing - e.g. torus-2QoS) is not currently supported.
    • Only rudimentary testing with qib (verified keepalives).
    • mlx4_core HW2SW_MPT -16 error requires update to recent firmware (internal build 2.33.1230 or later, GA build 2.33.5000 for ConnectX-3 Pro).
    • If running with OpenSM PerfMgr, need OpenSM 3.3.19 or later. Possible seg fault in PerfMgr was fixed there.
    • ACM is only tested in SSA acm_mode and not ACM acm_mode.
    • IP and name tables in ACM caches do not have records removed currently; only added or replaced.
    • mlx5 has been tested with Connect-IB but not ConnectX-4 (0.0.8 and 0.0.8.1 only)