Skip navigation
All Places > HPC > Blog > 2013 > October
2013

The deadline to submit a LINPACK run for the November 2013 entry to the Top500 is this upcoming Friday.

 

More info at this link:

 

http://top500.org/project/call_for_participation/

 

Top500Logo.gif

This is one of things to watch out for when doing a new installation for running HPC jobs with a job scheduler like the TORQUE Resource Manager. You might run into this kind of error messages in Open MPI, and similar errors on other MPI implementations.

 

In this case, Open MPI basically complains about the OpenIB BTL in Open MPI unable to allocate some locked memory, and advise the memlock limit to be set to unlimited.

 

ddn@jupiter032 ~]$ mpirun -v -np 8 -machinefile $PBS_NODEFILE --bynode /home/ddn/IOR/src/C/IOR -a POSIX -i3 -g -e -w -r -b 16g -t 4m -o /mnt/ddn_mlx/home/ddn/iortest
--------------------------------------------------------------------------

The OpenFabrics (openib) BTL failed to initialize while trying to

allocate some locked memory.  This typically can indicate that the

memlock limits are set too low.  For most HPC installations, the

memlock limits should be set to "unlimited".  The failure occured

here:

 

  Local host:    jupiter032

  OMPI source:   btl_openib_component.c:1216

  Function:      ompi_free_list_init_ex_new()

  Device:        mlx5_0

  Memlock limit: 65536

 

You may need to consult with your system administrator to get this

problem fixed.  This FAQ entry on the Open MPI web site may also be

helpful:

 

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: There was an error initializing an OpenFabrics device.

 

  Local host:   jupiter032

  Local device: mlx5_0

--------------------------------------------------------------------------

 

The fix is to set the memory limit to unlimited on the startup script for pbs_mom on each node. I also set the stack size to unlimited at the same time. Then restart the PBS MOM daemon on all the nodes.

 

[root@jupiter000 ~]# vim /etc/rc.d/init.d/pbs_mom

...

50 # how were we called

51 case "$1" in

52         start)

53                 echo -n "Starting TORQUE Mom: "

54                 ulimit -l unlimited

55                 ulimit -s unlimited

56                 # check if pbs_mom is already running

57                 status pbs_mom 2>&1 > /dev/null

58                 RET=$?

paklui

Using Connect-IB with Intel MPI

Posted by paklui Oct 28, 2013

With the Connect-IB is the latest InfiniBand adapter that is targeted for HPC space which offers the best scalability, throughput and message rate, it also comes with a new driver, called mlx5_ib, that is completely new driver and provides the new level of support on this new device. I recently discovered a few modifications that allows the Connect-IB HCA to achieve the best performance for Intel MPI (using the latest version, 4.1.1.036) that I would like to share (partly because I haven't found them elsewhere, either from Intel's web site or on the web...)

 

1. DAPL Provider

 

By default, Intel MPI would automatically select the DAPL provider for the communications used for InfiniBand. uDAPL stands for  user Direct Access Programming library that is an implementation of transport used for RDMA-capable devices like InfiniBand for example. (The other method is called the OFA provider which uses the IB Verbs for communications between the MPI processes that are resided on different systems within the Intel MPI. We will describe OFA in the section below.

 

For a dual-port Connect-IB adapter, the HCA might have been too new so that the /etc/dat.conf file do not contain the device which corresponds to the Connect-IB driver (mlx5_ib), so that you will need to manually enter these lines to tell Intel MPI to use the uDAPL for the Connect-IB devices in your /etc/dat.conf file. The entries in this file are processed "in order" so the ordering matters.

 

root@jupiter000 ~]# vim /etc/dat.conf

ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2" ""

ofa-v2-mlx5_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2" ""

 

Without these entries, you might run into some error messages at startup, simliar to these:

[86] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

 

2. OFA Provider

 

One of the reasons to use a dual-port Connect-IB HCA is to achieve the full bandwidth on a PCIe Gen3 bus on a x16 slot. The OFA provider has options to allow multi-rail communication which allows Intel MPI to run with the line rate on the IB cards. The default DAPL provider in Intel MPI can only make use of a single rail for communication.

 

If you intend to to use dual rail or multiple HCAs to run your application to maximize the throughput in communication, you will want to switch from the default DAPL provider, and to use the OFA provider for your Intel MPI job.

 

To run with the OFA provider, make sure your command line arguments contain this genv flag to the Intel MPI mpiexec/mpirun:

-genv MV2_USE_APM 0

 

This flag would disable the Automatic Path Migration feature. Presumably the Connect-IB HCA adapter is being too new which does not yet support this feature yet by Intel MPI.

 

One of the thing you need is to make sure you have the

mpiexec -perhost 20 -IB -genv I_MPI_OFA_ADAPTER_NAME mlx5_0 -genv I_MPI_OFA_NUM_PORTS 2 -np 640 ~/imb_3.2.3/src/IMB-MPI1

[180] Abort: Failed to modify QP

at line 1242 in file ../../ofa_utility.c

 

3. Alternatives

 

When in doubts, you can verify by running the osu_bw test between 2 nodes using either Open MPI or MVAPICH2 as a sanity test; both are included/built with MLNX_OFED at the location below. If properly configured, you should expect somewhere around 12.5-13GB/s of bandwidth. Below is an example of the Open MPI using the 2 ports running at FDR 56Gbps rate each between 2 nodes. The command line below shows Open MPI automatically detects the fastest adapters to and enables multi-rail by default.

 

[pak@jupiter000 ~]$ /usr/mpi/gcc/openmpi-1.6.5/bin/mpirun \

-host jupiter001,jupiter002 \

/usr/mpi/gcc/openmpi-1.6.5/tests/osu-micro-benchmarks-4.0.1/osu_bw

# OSU MPI Bandwidth Test v4.0.1

# Size      Bandwidth (MB/s)

<snip...>

1048576             12876.08

2097152             12928.06

4194304             12955.38

[pak@jupiter000 ~]$


Lastly, be sure to use with the latest MLNX_OFED (current latest is 2.0-3.0.0) which contains our latest improvement in Connect-IB performance.

GWColonialOne.jpg

This week I am here for a few days at George Washington University at the Dell XL user conference. GWU graciously hosted the fall event and provided a tour of their facilities.

 

The Colonial One HPC initiative is a joint venture between GW’s Division of Information Technology, Columbian College of Arts and Sciences and the School of Medicine and Health Sciences.

  

The cluster showcases performance, density, efficiency.  It’s modular architecture will support growth
allowing enhancement and expansion.  Using common hardware, the cluster accommodates a variety of hardware
configurations and accelerators.

 

The cluster features 1408 cpu cores and 159,744 CUDA cores in Dell C8220 and C8220x nodes interconnected with Mellanox 56 Gb/s FDR ConnectX-3 network.  A very impressive collection of technology and one of the few that has (about 33% of the system) this many GPU enabled nodes.