0 Replies Latest reply on Jun 11, 2013 8:23 AM by jescudero

    OpenMPI 1.6.4   MLNX_OFED 2.0

      I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards.
      Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason
      of this e-mail to the OpenMPI users list is that I am not able to run
      MPI applications using the service levels (SLs) feature of the OpenMPI
      driver.

      Currently, the nodes have the Red-Hat 6.4 with the kernel
      2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:

        ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc
      --enable-mpi-thread-multiple --with-threads --with-hwloc
      --enable-heterogeneous --with-fca=/opt/mellanox/fca
      --with-mxm-libdir=/opt/mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm
      --prefix=/home/jescudero/opt/openmpi

      I have modified the OpenSM code (which is based on 3.3.15) in order to
      include a special routing algorithm based on "ftree". Apparently all is
      correct with the OpenSM since it returns the SLs when I execute the
      command "saquery --src-to-dst slid:dlid". Anyway, I have also tried to
      run the OpenSM with the DFSSSP algorithm.

      However, when I try to run MPI applications (i.e. HPCC, OSU or even
      alltoall.c -included in the OpenMPI sources-) I experience some errors
      if the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if
      the btl_openib_path_record_info is not enabled) the application
      execution ends correctly. I run the MPI application with the next command:

      mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux
      --mca btl openib,self,sm --mca mtl mxm --mca
      btl_openib_ib_path_record_service_level 1 --mca btl_openib_cpc_include
      oob hpcc

      I obtain the next trace:

      [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
      error posting receive on QP [0x16db] errno says: Success [0]
      [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
      error posting receive on QP [0x1749] errno says: Success [0]
      [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
      error posting receive on QP [0x1783] errno says: Success [0]
      [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
      error posting receive on QP [0x1838] errno says: Success [0]
      [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
      endpoint connect error: -1
      [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
      endpoint connect error: -1
      [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
      endpoint connect error: -1
      [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
      endpoint connect error: -1

      Does anyone know what I am doing wrong?

      All the best,