This is one of things to watch out for when doing a new installation for running HPC jobs with a job scheduler like the TORQUE Resource Manager. You might run into this kind of error messages in Open MPI, and similar errors on other MPI implementations.

 

In this case, Open MPI basically complains about the OpenIB BTL in Open MPI unable to allocate some locked memory, and advise the memlock limit to be set to unlimited.

 

ddn@jupiter032 ~]$ mpirun -v -np 8 -machinefile $PBS_NODEFILE --bynode /home/ddn/IOR/src/C/IOR -a POSIX -i3 -g -e -w -r -b 16g -t 4m -o /mnt/ddn_mlx/home/ddn/iortest
--------------------------------------------------------------------------

The OpenFabrics (openib) BTL failed to initialize while trying to

allocate some locked memory.  This typically can indicate that the

memlock limits are set too low.  For most HPC installations, the

memlock limits should be set to "unlimited".  The failure occured

here:

 

  Local host:    jupiter032

  OMPI source:   btl_openib_component.c:1216

  Function:      ompi_free_list_init_ex_new()

  Device:        mlx5_0

  Memlock limit: 65536

 

You may need to consult with your system administrator to get this

problem fixed.  This FAQ entry on the Open MPI web site may also be

helpful:

 

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: There was an error initializing an OpenFabrics device.

 

  Local host:   jupiter032

  Local device: mlx5_0

--------------------------------------------------------------------------

 

The fix is to set the memory limit to unlimited on the startup script for pbs_mom on each node. I also set the stack size to unlimited at the same time. Then restart the PBS MOM daemon on all the nodes.

 

[root@jupiter000 ~]# vim /etc/rc.d/init.d/pbs_mom

...

50 # how were we called

51 case "$1" in

52         start)

53                 echo -n "Starting TORQUE Mom: "

54                 ulimit -l unlimited

55                 ulimit -s unlimited

56                 # check if pbs_mom is already running

57                 status pbs_mom 2>&1 > /dev/null

58                 RET=$?