Designing an HPC Cluster with Mellanox InfiniBand Solutions

Version 6

    High-performance computing (HPC) encompasses advanced computation over parallel processing, enabling faster execution of highly compute intensive tasks such as climate research, molecular modeling, physical simulations, cryptanalysis, geophysical modeling, automotive and aerospace design, financial modeling, data mining and more. High-performance simulations require the most efficient compute platforms. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores and their utilization factor and the interconnect performance, efficiency, and scalability. Efficient high-performance computing systems require high-bandwidth, low-latency connections between thousands of multi-processor nodes, as well as high-speed storage systems.

     

    This reference design describes how to design HPC cluster using Mellanox InfiniBand interconnect solution.

     

     

    References

     

     

    Network Topologies

    There are several common topologies for an InfiniBand fabric. The following lists some of those topologies:

    • Fat tree: A multi-root tree. This is the most popular topology.
    • 2D mesh: Each node is connected to four other nodes; positive, negative, X axis and Y axis
    • 3D mesh: Each node is connected to six other nodes; positive and negative X, Y and Z axis
    • 2D/3D torus: The X, Y and Z ends of the 2D/3D mashes are “wrapped around” and connected to the first node

     

    and others.

     

    Fat Tree Topology

    The most widely used topology in HPC clusters is a one that users a fat-tree topology. This topology typically enables the best performance at a large scale when configured as a non-blocking network. Where over-subscription of the network is tolerable, it is possible to configure the cluster in a blocking configuration as well. A fat-tree cluster typically uses the same bandwidth for all links and in most cases it uses the same number of ports in all of the switches.

     

    Here in an example of a Fat-Tree topology that consist of 324 Nodes using 36-Port switches.

    Note the following:

    • The spine switches consist of 9 switches marked as L2 (layer 2) and 18 switches marked as L1 (layer 1), that makes it 27 switches in total.
    • 18 links from each L1 switch can be connected to hosts. which makes 324 (18*18) possible nodes to be connected in a non-blocking manner.
    • Non blocking ration means that the half the ports are connected to the host, while the other half are connected to the spines switches, in this example it is 18 ports.
    • Each L1 switch will be connected using two ports to each L2 switch (9*2=18).

     

    2.jpg

    Rules for Designing the Fat-Tree Cluster

    The following rules must be adhered to when building a fat-tree cluster:

    • Non-blocking clusters must be balanced. The same number of links must connect a Level-2 (L2) switch to every Level-1 (L1) switch. Whether over-subscription is possible depends on the HPC application and the network requirements.
    • If the L2 switch is a director switch (that is, a switch with leaf and spine cards), all L1 switch links to an L2 switch must be evenly distributed among leaf cards. For example, if six links run between an L1 and L2 switch, it can be distributed to leaf cards as 1:1:1:1:1:1, 2:2:2, 3:3, or 6. It should never be mixed, for example, 4:2, 5:1.
    • Do not create routes that must traverse up, back-down, and then up the tree again. This creates a situation called credit loops and can manifest itself as traffic deadlocks in the cluster. In general, there is no way to avoid credit loops. Any fat-tree with multiple directors plus edge switches has physical loops which are avoided by using a routing algorithm such as up-down.
    • Try to always use 36-port switches as L1 and director class switches in L2. If this cannot be maintained, please consult a Mellanox technical representative to ensure that the cluster being designed does not contain credit loops.

     

    For assistance in designing fat-tree clusters, the Mellanox InfiniBand Cluster Configurator is an online cluster configuration tool that offers flexible cluster sizes and options.

     

    Here is an example for balanced Fat-Tree topology.

    3.jpg

     

    Blocking Scenarios for Small Scale Clusters

    In some cases, the size of the cluster may demand end-port requirements that marginally exceed the maximum possible non-blocking ports in a “tiered” fat-tree.

    For example, if a cluster requires only 36 ports, it can be realized with a single 36-port switch building block. As soon as the requirement exceeds 36 ports, one must create a 2-level (tier) fat-tree. For example, if the requirement is for 72 ports, to achieve a full non-blocking topology, one requires six 36-port switches. In such configurations, the network cost does not scale linearly to the number of ports, rising significantly. The same problem arises when one crosses the 648-port boundary for a 2-level full non-blocking network. Designing a large cluster requires careful network planning. However, for small or mid-sized systems, one can consider a blocking network or even simple meshes.

     

    Consider a basic case of 48-ports as an example. This cluster can be realized with two 36-port switches with a blocking ratio of 1:2. This means that there are certain source-destination communication pairs that cause one switch-switch link to carry traffic from two communicating node pairs. On the other hand, this cluster can now be realized with two switches instead of more.

     

    4.png

     

    Ring Topology and Credit Loops

    Note that a “ring” network beyond three switches is not a valid configuration and creates credit-loops resulting in network deadlock.

    5.png

     

    6.png

     

     

    Topology Examples

    CLOS-3 Topology (Non-Blocking)

     

    72 Node Fat-Tree

    Using 1U SX6036 switches.

    7.png

    324 Node Fat-Tree

    Using 1U SX6036 switches or Director switch (SX6518).

     

    8.png

     

    648 Node Fat-Tree using

    Using 1U SX6036 switches or Director switch (SX6536).

     

    9.png

     

    CLOS-5 Topology (Non-Blocking)

     

    1296 Node Fat-Tree

    Using 1U SX6036 switches and Director switch (SX6536).

     

    10.png

     

     

    1944 Node Fat-Tree

    Using 1U SX6036 switches and Director switch (SX6536).

    11.png

    3888 Node Fat-Tree

    Using 1U SX6036 switches and Director switch (SX6536).

     

    12.png

    Performance Calculations

     

    The formula to calculate a node performance in floating point operations per second (FLOPS) is as follows:

    Node performance in FLOPS = (CPU speed in Hz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)

     

    For example, for Intel Dual-CPU server based on Intel E5-2690 (2.9GHz 8-cores) CPUs:

    2.9 x 8 x 8 x 2 = 371.2 GFLOPS (per server).

     

    Note: The number of instructions per cycle for E5-2600 series CPUs is equal to 8.

    To calculate the cluster performance, multiply the resulting number with the number of nodes in the HPC system to get the peak theoretical. A 72-node fat-tree (using 6 switches) cluster has:

     

    371.2GFLOPS x 72 (nodes) = 26,726GFLOPS = ~27TFLOPS

     

    A 648-node fat-tree (using 54 switches) cluster has:

     

    371.2GFLOPS x 648 (nodes) = 240,537GFLOPS = ~241TFLOPS

     

    For fat-trees larger than 648 nodes, the HPC cluster must at least have 3 levels of hierarchy. For advance calculations that include GPU acceleration – refer to the following link:

    http://optimisationcpugpu-hpc.blogspot.com/2012/10/how-to-calculate-flops-of-gpu.html.

     

    The actual performance derived from the cluster depends on the cluster interconnect. On average, using 1 gigabit Ethernet (GbE) connectivity reduces cluster performance by 50%. Using 10GbE one can expect 30% performance reduction. InfiniBand interconnect however yields 90% system efficiency; that is, only 10% performance loss. Refer to www.top500.org for additional information.

     

     

    Cluster Size

    Theoretical Performance
    (100%)

    1GbE Network (50%)

    10GbE  Network  (70%)

    FDR InfiniBand Network  (90%)

    Units

    72-Node cluster

    27

    13.5

    19

    24.3

    TFLOPS

    324-Node cluster

    120

    60

    84

    108

    TFLOPS

    648-Node cluster

    241

    120.5

    169

    217

    TFLOPS

    1296 Node cluster

    481

    240

    337

    433

    TFLOPS

    1944 Node cluster

    722

    361

    505

    650

    TFLOPS

    3888 Node cluster

    1444

    722

    1011

    1300

    TFLOPS

     

    Note: InfiniBand is the predominant interconnect technology in the HPC market. InfiniBand has many characteristics that make it ideal for HPC including:

    • Low latency and high throughput
    • Remote Direct Memory Access (RDMA)
    • Flat Layer 2 that scales out to thousands of endpoints
    • Centralized management
    • Multi-pathing
    • Support for multiple topologies

     

    Benchmarks

    Refer to HPC Advisory council website for various best practices benchmarks papers HPC Advisory Council - Best Practices

     

    See the following ones for example:

     

    Maximizing Performance with HPC-X Software Toolkit

    The Mellanox HPC-X Toolkit is a comprehensive MPI, SHMEM and UPC software suite for high performance computing environments.  HPC-X also includes various acceleration packages to improve both the performance and scalability of applications running on top of these libraries, including MXM (Mellanox Messaging) which accelerates the underlying send/receive (or put/get) messages, and FCA (Fabric Collectives Accelerations) which accelerates the underlying collective operations used by the MPI/PGAS languages. This full-featured, tested and packaged version of HPC software enables MPI, SHMEM and PGAS programming languages to scale to extremely large clusters by improving on memory and latency related efficiencies, and to assure that the communication libraries are fully optimized of the Mellanox interconnect solutions.

     

    HPC-X main features are:

       

    • Complete MPI, SHMEM, UPC package, including Mellanox MXM and FCA acceleration engines
    • Offload collectives communication from MPI process onto Mellanox interconnect hardware
    • Maximize application performance with underlying hardware architecture
    • Fully optimized for Mellanox InfiniBand and VPI interconnect solutions
    • Increase application scalability and resource efficiency
    • Multiple transport support including RC, DC and UD
    • Intra-node shared memory communication
    • Receive side tag matching
    • Native support for MPI-3

     

    Communication Library Support in HPC-X

    To enable early and transparent adoption of the capabilities provided by Mellanox’s interconnects, Mellanox developed and supports two libraries:

    • Mellanox Messaging (MXM)
    • Fabric Collective Accelerator (FCA)

    These communication libraries provide full support for upper level protocols (ULPs; e.g. MPI) and PGAS libraries (e.g. OpenSHMEM and UPC).  Both MXM and FCA are also provided as standalone libraries, with well-defined interfaces, and are used by several commercial and open-source ULP's to provide well optimized point-to-point and collective communication libraries that exploit the performance of the underlying interconnect.  Below is a pictorial of these libraries and their core functions :

     

    13.png

     

    Fabric Collective Accelerator

    The Fabric Collective Accelerator (FCA) library provides support for MPI and PGAS collective operations. The FCA is designed with modular component architecture to facilitate the rapid deployment of new algorithms and topologies via component plugins. The FCA can take advantage of the increasingly stratified memory and network hierarchies found in current and emerging HPC systems. Scalability and extensibility are two of the primary design objectives of the FCA. As such, FCA topology plugins support hierarchies based on InfiniBand switch layout and shared memory hierarchies – both share sockets, as well as employ NUMA sharing. The FCA’s plugin-based implementation minimizes the time needed to support new types of topologies and hardware capabilities.

    At its core, FCA is an engine that enables hardware assisted non-blocking collectives. In particular, FCA exposes CORE-Direct capabilities. With CORE-Direct, the HCA manages and progresses a collective communication operation in an asynchronous manner without CPU involvement. FCA is endowed with a CORE-Direct plugin module that supports fully asynchronous, non-blocking collective operations whose capabilities are fully realized as implementations of MPI-3 non-blocking collective routines. In addition to providing full support for asynchronous, non-blocking collective communications, FCA exposes ULPs to the HCA’s ability to perform floating-point and integer reduction operations.

    Since the performance and scalability of collective communications often play a key role in the scalability and performance of many HPC scientific applications, CORE-Direct technology is introduced by Mellanox as one mechanism for addressing these issues. The offloaded capabilities can be leveraged to improve overall application performance by enabling the overlapping of communication with computation. As system sizes continue to increase, the ability to overlap communication and computational operations becomes increasingly important to improve overall system utilization, time to solution, and minimize energy consumption. This capability is also extremely important for reducing the negative effects of system noise. By using the HCA to manage and progress collective communications, process skew attributed to kernel-level interrupts and its tendency to “amplify” latency at large-scale can be minimized.

     

    Mellanox Messaging

    The Mellanox Messaging (MXM) library provides point-to-point communication services including send/receive, RDMA, atomic operations, and active messaging. In addition to these core services, MXM also supports important features such as one-sided communication completion needed by ULPs that define one-sided communication operations such as MPI, SHMEM, and UPC. MXM supports several InfiniBand+ transports through a thin, transport agnostic interface. Supported transports include the scalable Dynamically Connected Transport (DC), Reliably Connected Transport (RC), Unreliable Datagram (UD), Ethernet RoCE, and a Shared Memory transport for optimizing on-host, latency sensitive communication. MXM provides a vital mechanism for Mellanox to rapidly introduce support for new hardware capabilities that deliver high-performance, scalable and fault-tolerant communication services to end-users.

    MXM leverages communication offload capabilities to enable true asynchronous communication progress, freeing up the CPU for computation. This multi-rail, thread-safe support includes both asynchronous send/receive and RDMA modes of communication. In addition, there is support for leveraging the HCA’s extended capabilities to directly handle non-contiguous data transfers for the two aforementioned communication modes.

    MXM employs several methods to provide a scalable resource foot-print. These include support for DCT, receive side flow-control, long-message Rendezvous protocol, and the so-called zero-copy send/receive protocol. Additionally, it provides support for a limited number of RC connections which can be used when persistent communication channels are more appropriate than dynamically created ones.

    When ULPs, such as MPI, define their fault-tolerant support, MXM will fully support these features.

    As mentioned, MXM provides full support for both MPI and several PGAS protocols, including OpenSHMEM and UPC. MXM also provides offloaded hardware support for send/receive, RDMA, atomic, and synchronization operations.

     

    Please visit the Mellanox HPC-X Software Toolkit site page for the latest release download, release notes and user manual to get you up and running quickly.

     

     

     

     

    Quality of Service

    Quality of Service (QoS) requirements stem from the realization of I/O consolidation over an InfiniBand network. As multiple applications may share the same fabric, a means is needed to control their use of network resources.

    The basic need is to differentiate the service levels provided to different traffic flows, such that a policy can be enforced and can control each flow-utilization of fabric resources.

    The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS:

    • Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
    • Arbitration between traffic of different VLs is performed by a two-priority-level weighted round robin arbiter. The arbiter is programmable with a sequence of (VL, weight) pairs and a maximal number of high priority credits to be processed before low priority is served.
    • Packets carry class of service marking in the range 0 to 15 in their header SL field
    • Each switch can map the incoming packet by its SL to a particular output VL, based on a programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
    • The Subnet Administrator controls the parameters of each communication flow by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) queries

     

     

    Subnet Manager

    InfiniBand uses a centralized resource, called a subnet manager (SM), to handle the management of the fabric. The SM discovers new endpoints that are attached to the fabric, configures the endpoints and switches with relevant networking information, and sets up the forwarding tables in the switches for all packet-forwarding operations.

    There are three options to select the best place to locate the SM:

    • Enabling the SM on one of the managed switches. This is a very convenient and quick operation. Only one command is needed to turn the SM on. This helps to make InfiniBand ‘plug & play’, one less thing to install and configure.
      In a blade switch environment it is common due to the following advantages:
      • Blade servers are normally expensive to allocate as SM servers; and
      • Adding non-blade (standalone rack-mount) servers dilutes the value proposition of blades (easy upgrade, simplified cabling, etc).
    • Server-based SM would make a lot of sense for large clusters so there’s enough CPU power to cope with things such as major topology changes. Weaker CPUs can handle large fabrics but it may take a long time for the servers to come back up. In addition, there may be situations where the SM is too slow to ever catch up. Therefore, it is recommended to run the SM on a server in case there are 648 nodes or more.
    • Use Unified Fabric Management (UFM) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.