With the Connect-IB is the latest InfiniBand adapter that is targeted for HPC space which offers the best scalability, throughput and message rate, it also comes with a new driver, called mlx5_ib, that is completely new driver and provides the new level of support on this new device. I recently discovered a few modifications that allows the Connect-IB HCA to achieve the best performance for Intel MPI (using the latest version, 4.1.1.036) that I would like to share (partly because I haven't found them elsewhere, either from Intel's web site or on the web...)

 

1. DAPL Provider

 

By default, Intel MPI would automatically select the DAPL provider for the communications used for InfiniBand. uDAPL stands for  user Direct Access Programming library that is an implementation of transport used for RDMA-capable devices like InfiniBand for example. (The other method is called the OFA provider which uses the IB Verbs for communications between the MPI processes that are resided on different systems within the Intel MPI. We will describe OFA in the section below.

 

For a dual-port Connect-IB adapter, the HCA might have been too new so that the /etc/dat.conf file do not contain the device which corresponds to the Connect-IB driver (mlx5_ib), so that you will need to manually enter these lines to tell Intel MPI to use the uDAPL for the Connect-IB devices in your /etc/dat.conf file. The entries in this file are processed "in order" so the ordering matters.

 

root@jupiter000 ~]# vim /etc/dat.conf

ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2" ""

ofa-v2-mlx5_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2" ""

 

Without these entries, you might run into some error messages at startup, simliar to these:

[86] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

 

2. OFA Provider

 

One of the reasons to use a dual-port Connect-IB HCA is to achieve the full bandwidth on a PCIe Gen3 bus on a x16 slot. The OFA provider has options to allow multi-rail communication which allows Intel MPI to run with the line rate on the IB cards. The default DAPL provider in Intel MPI can only make use of a single rail for communication.

 

If you intend to to use dual rail or multiple HCAs to run your application to maximize the throughput in communication, you will want to switch from the default DAPL provider, and to use the OFA provider for your Intel MPI job.

 

To run with the OFA provider, make sure your command line arguments contain this genv flag to the Intel MPI mpiexec/mpirun:

-genv MV2_USE_APM 0

 

This flag would disable the Automatic Path Migration feature. Presumably the Connect-IB HCA adapter is being too new which does not yet support this feature yet by Intel MPI.

 

One of the thing you need is to make sure you have the

mpiexec -perhost 20 -IB -genv I_MPI_OFA_ADAPTER_NAME mlx5_0 -genv I_MPI_OFA_NUM_PORTS 2 -np 640 ~/imb_3.2.3/src/IMB-MPI1

[180] Abort: Failed to modify QP

at line 1242 in file ../../ofa_utility.c

 

3. Alternatives

 

When in doubts, you can verify by running the osu_bw test between 2 nodes using either Open MPI or MVAPICH2 as a sanity test; both are included/built with MLNX_OFED at the location below. If properly configured, you should expect somewhere around 12.5-13GB/s of bandwidth. Below is an example of the Open MPI using the 2 ports running at FDR 56Gbps rate each between 2 nodes. The command line below shows Open MPI automatically detects the fastest adapters to and enables multi-rail by default.

 

[pak@jupiter000 ~]$ /usr/mpi/gcc/openmpi-1.6.5/bin/mpirun \

-host jupiter001,jupiter002 \

/usr/mpi/gcc/openmpi-1.6.5/tests/osu-micro-benchmarks-4.0.1/osu_bw

# OSU MPI Bandwidth Test v4.0.1

# Size      Bandwidth (MB/s)

<snip...>

1048576             12876.08

2097152             12928.06

4194304             12955.38

[pak@jupiter000 ~]$


Lastly, be sure to use with the latest MLNX_OFED (current latest is 2.0-3.0.0) which contains our latest improvement in Connect-IB performance.