Skip navigation
All Places > HPC > Blog > Authors paklui


9 Posts authored by: paklui

A reminder for folks who intend to submit a LINPACK run for their HPC clusters or supercomputers, the submission deadline for the upcoming June 2015 list of the Top500 is actually tomorrow, (Friday, June 12, 2015 at 23:59 US Pacific Time).


More info at this link:

A reminder for folks who intend to submit a LINPACK run for their HPC clusters or supercomputers, the submission deadline for the upcoming November 2014 list of the Top500 is this Saturday, October 25, 2014.


More info at this link:

Just a reminder for folks who intend to submit a LINPACK run with their shiny new HPC clusters or supercomputers, the deadline to submit for the June 2014 entry to the Top500 is this upcoming Sunday, May 18, 2014.


More info at this link:



Last week Mellanox participated in the HPC Advisory Council Switzerland Conference 2014 which was held at the Palazzo dei Congressi (Lugano Convention Centre) in Lugano, Switzerland. The conference was a 4-day event that ran from March 31 to April 3 this year. It was hosted jointly by the HPC Advisory Council and the Swiss National Supercomputing Centre (CSCS).


The event began with the workshop on the first day, which included presentations from Mellanox on InfiniBand Principles Every HPC Expert MUST know, presented by Oded Paz. He presented a walk through of InfiniBand principles, InfiniBand fabrics, protocols and introductions to Mellanox products. The YouTube videos for this presentation are featured at insideHPC:



In the latter presentation, Dror Goldenberg, VP Software Architecture of Mellanox, presented the Future of Interconnect, which he describes the HPC challenges ahead in the future, and the plans for Mellanox to enhance the scalability performance for Exascale.



You may also find it interesting to read about the second keynote by Dr. Dhabaleswar K. Panda of the Ohio State University. Dr Panda presented the "Programming Models for Exascale Systems" where he described the challenges in designing runtime environments of MPI/PGAS (UPC and OpenSHMEM programming models. Particularly, he showed the latest improvement in the MVAPICH2-GDR 2.0b+enhancements which is shaved another 31% of the latency, (Located on page 47 of the presentation).


The OpenPOWER Initiative which Mellanox is part of, is also being presented by Gilad Shainer of HPC Advisory Council on behalf of OpenPOWER.



Embedded image permalink


All of the the materials and presentation are posted on the HPC Advisory Council. In particular, the Video Gallery: HPCAC Swiss Conference 2014 from insideHPC are also up on the web.




The next HPC Advisory Council will be held at the University of São Paulo in São Paulo, Brazil on May 26, 2014. Rich Graham will be representing Mellanox and MPI Forum to present the "Interconnecting The Exascale Machine". A tentative agenda is posted here:

HPC Advisory Council Brazil Conference and Exascale Workshop 2014


As a followup to the earlier post on Test Driving GPUDirect RDMA with MVAPICH2-GDR and Open MPI, we have an HPC molecular dynamics (MD) simulation application called HOOMD-blue that demonstrates the benefits of using the GPUDirect RDMA technology.


HOOMD-blueHOOMD-blue stands for Highly Optimized Object-oriented Many-particle Dynamics -- Blue Edition, which performs general purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many processor cores on a fast cluster. It is a free and open source code and its development effort is led by the Glotzer group at the University of Michigan.


The HPC Advisory Council has performed benchmarking studies with HOOMD-blue. The benchmarking study has highlighted the benefits of using GPUDirect RDMA available in the Connect-IB FDR InfiniBand adapter. The GPUDirect RDMA could also work for the ConnectX-3 FDR InfiniBand adapter as well.


The performance improvement can be seen on a small 4-node cluster at the HPC Advisory Council, as well as on the 96 nodes of the Wilkes cluster at the University of Cambridge. The graph below shows a 20% improvement of GPUDirect RDMA at 4 nodes using the 16K particles case:



It is also demonstrated that by deploying GPUDirect RDMA on HOOMD-blue, the scalability performance on the Wilkes cluster at the University of Cambridge is able to surpass ORNL Titan cluster by up to 114% at 32 nodes using the same input dataset, despite the slightly slower NVIDIA K20 GPUs are being run on the Wilkes cluster.



The complete performance benchmarking study of HOOMD-blue can be found at this HPC Advisory Council presentation:


Mellanox at SC13 in Denver

Posted by paklui Nov 14, 2013

Attending the SC13 conference in Denver next week? gfx_02086.jpg

Yes? Be sure to stop by the Mellanox booth at booth #2722 and check out the latest products, technology demonstrations, and FDR InfiniBand performance with Connect-IB!


We have a long list of theater presentations with our partners at the Mellanox booth. We will have giveaways at every presentation and a lucky attendee will go home with a new Apple iPad3 mini at the end of each day!


Don’t forget to sign up for Mellanox Special Evening Event During SC13 on Wednesday night. Visit:

Sheraton Denver Downtown Hotel
Plaza Ballroom
1550 Court Place
Denver, Colorado 80202
Phone: (303) 893-3333
  Map It  

Wednesday, November 20th
7:00PM - 10:00PM


Long flight to Denver? Make sure you get the Print ‘n Fly guide from insideHPC to read on your flight to Denver!

print'nfly cover

Finally, if you are joining us for the technical sessions,come to hear from our experts in these SC13 sessions:


Speaking: Gilad Shainer, VP Marketing; Richard Graham, Sr. Solutions Architect

Title: "OpenSHMEM BoF"

Date: Wednesday, November 20, 2013

Time: 5:30PM - 7:00PM

Room: 201/203


Speaking: Richard Graham, Sr. Solutions Architect

Title: "Technical Paper Session Chair: Inter-Node Communication"

Date: Thursday, November 21, 2013

Time: 10:30AM - 12:00PM

Room: 405/406/407


Speaking: Richard Graham, Sr. Solutions Architect

Title: "MPI Forum BoF"

Date: Thursday, November 21, 2013

Time: 12:15PM-1:15PM

Room: 705/707/709/711

See you all in Denver!

The deadline to submit a LINPACK run for the November 2013 entry to the Top500 is this upcoming Friday.


More info at this link:



This is one of things to watch out for when doing a new installation for running HPC jobs with a job scheduler like the TORQUE Resource Manager. You might run into this kind of error messages in Open MPI, and similar errors on other MPI implementations.


In this case, Open MPI basically complains about the OpenIB BTL in Open MPI unable to allocate some locked memory, and advise the memlock limit to be set to unlimited.


ddn@jupiter032 ~]$ mpirun -v -np 8 -machinefile $PBS_NODEFILE --bynode /home/ddn/IOR/src/C/IOR -a POSIX -i3 -g -e -w -r -b 16g -t 4m -o /mnt/ddn_mlx/home/ddn/iortest

The OpenFabrics (openib) BTL failed to initialize while trying to

allocate some locked memory.  This typically can indicate that the

memlock limits are set too low.  For most HPC installations, the

memlock limits should be set to "unlimited".  The failure occured



  Local host:    jupiter032

  OMPI source:   btl_openib_component.c:1216

  Function:      ompi_free_list_init_ex_new()

  Device:        mlx5_0

  Memlock limit: 65536


You may need to consult with your system administrator to get this

problem fixed.  This FAQ entry on the Open MPI web site may also be




WARNING: There was an error initializing an OpenFabrics device.


  Local host:   jupiter032

  Local device: mlx5_0



The fix is to set the memory limit to unlimited on the startup script for pbs_mom on each node. I also set the stack size to unlimited at the same time. Then restart the PBS MOM daemon on all the nodes.


[root@jupiter000 ~]# vim /etc/rc.d/init.d/pbs_mom


50 # how were we called

51 case "$1" in

52         start)

53                 echo -n "Starting TORQUE Mom: "

54                 ulimit -l unlimited

55                 ulimit -s unlimited

56                 # check if pbs_mom is already running

57                 status pbs_mom 2>&1 > /dev/null

58                 RET=$?


Using Connect-IB with Intel MPI

Posted by paklui Oct 28, 2013

With the Connect-IB is the latest InfiniBand adapter that is targeted for HPC space which offers the best scalability, throughput and message rate, it also comes with a new driver, called mlx5_ib, that is completely new driver and provides the new level of support on this new device. I recently discovered a few modifications that allows the Connect-IB HCA to achieve the best performance for Intel MPI (using the latest version, that I would like to share (partly because I haven't found them elsewhere, either from Intel's web site or on the web...)


1. DAPL Provider


By default, Intel MPI would automatically select the DAPL provider for the communications used for InfiniBand. uDAPL stands for  user Direct Access Programming library that is an implementation of transport used for RDMA-capable devices like InfiniBand for example. (The other method is called the OFA provider which uses the IB Verbs for communications between the MPI processes that are resided on different systems within the Intel MPI. We will describe OFA in the section below.


For a dual-port Connect-IB adapter, the HCA might have been too new so that the /etc/dat.conf file do not contain the device which corresponds to the Connect-IB driver (mlx5_ib), so that you will need to manually enter these lines to tell Intel MPI to use the uDAPL for the Connect-IB devices in your /etc/dat.conf file. The entries in this file are processed "in order" so the ordering matters.


root@jupiter000 ~]# vim /etc/dat.conf

ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default dapl.2.0 "mlx5_0 2" ""

ofa-v2-mlx5_0-1 u2.0 nonthreadsafe default dapl.2.0 "mlx5_0 1" ""

ofa-v2-mlx5_0-2 u2.0 nonthreadsafe default dapl.2.0 "mlx5_0 2" ""


Without these entries, you might run into some error messages at startup, simliar to these:

[86] MPI startup(): dapl fabric is not available and fallback fabric is not enabled


2. OFA Provider


One of the reasons to use a dual-port Connect-IB HCA is to achieve the full bandwidth on a PCIe Gen3 bus on a x16 slot. The OFA provider has options to allow multi-rail communication which allows Intel MPI to run with the line rate on the IB cards. The default DAPL provider in Intel MPI can only make use of a single rail for communication.


If you intend to to use dual rail or multiple HCAs to run your application to maximize the throughput in communication, you will want to switch from the default DAPL provider, and to use the OFA provider for your Intel MPI job.


To run with the OFA provider, make sure your command line arguments contain this genv flag to the Intel MPI mpiexec/mpirun:

-genv MV2_USE_APM 0


This flag would disable the Automatic Path Migration feature. Presumably the Connect-IB HCA adapter is being too new which does not yet support this feature yet by Intel MPI.


One of the thing you need is to make sure you have the

mpiexec -perhost 20 -IB -genv I_MPI_OFA_ADAPTER_NAME mlx5_0 -genv I_MPI_OFA_NUM_PORTS 2 -np 640 ~/imb_3.2.3/src/IMB-MPI1

[180] Abort: Failed to modify QP

at line 1242 in file ../../ofa_utility.c


3. Alternatives


When in doubts, you can verify by running the osu_bw test between 2 nodes using either Open MPI or MVAPICH2 as a sanity test; both are included/built with MLNX_OFED at the location below. If properly configured, you should expect somewhere around 12.5-13GB/s of bandwidth. Below is an example of the Open MPI using the 2 ports running at FDR 56Gbps rate each between 2 nodes. The command line below shows Open MPI automatically detects the fastest adapters to and enables multi-rail by default.


[pak@jupiter000 ~]$ /usr/mpi/gcc/openmpi-1.6.5/bin/mpirun \

-host jupiter001,jupiter002 \


# OSU MPI Bandwidth Test v4.0.1

# Size      Bandwidth (MB/s)


1048576             12876.08

2097152             12928.06

4194304             12955.38

[pak@jupiter000 ~]$

Lastly, be sure to use with the latest MLNX_OFED (current latest is 2.0-3.0.0) which contains our latest improvement in Connect-IB performance.