10 Replies Latest reply on Oct 5, 2016 12:42 AM by alinaskl

    OpenMPI with MXM 32bit issue

    meltho

      We have the problem that after 2.147 billion messages, which is the range of an int32_t, we cannot receive any more messages.

       

      Compiling OpenMPI with the flag "--with-mxm=/path/to/mxm" causes this problem while without this flag everything is fine. The Problem is reproducible with the attached example code, by compiling and running it with the follwing commands:

      $ /path/to/openmpi/bin/mpic++ openmpi_mxm_freeze.cxx -o openmpi_mxm_freeze

      $ /path/to/openmpi/bin/mpirun -np 2 openmpi_mxm_freeze

       

      Maybe the issue is connected with the following lines from "mxm_def.h":

      typedef uint32_t             mxm_tag_t;/* MXM tag type */
      typedef uint32_t             mxm_imm_t;/* MXM immediate data type */

       

      The problem occurs with the newest Mellanox firmware, OFED package and OpenMPI version.

        • Re: OpenMPI with MXM 32bit issue
          alinaskl

          Hello Thomas,

           

            • Re: OpenMPI with MXM 32bit issue
              meltho

              Hello Alina,

               

              thank you for your response. I meant the case on one host, but I will check the two host case anyway.

              One part of the problem is, that although the Infiniband network is not involved in the single host case, the example does not run properly if OpenMPI is compiled with the "--with-mxm" option.

               

              Thomas

              • Re: OpenMPI with MXM 32bit issue
                meltho

                Well, thats interesting.

                The case on two hosts works fine:

                $ /opt/openmpi-2.0.1-jessie-mxm-mt/bin/mpirun -np 2 -hostfile hostfile --map-by node --display-map -mca pml yalla openmpi_mxm_freeze

                Data for JOB [31717,1] offset 0

                ========================  JOB MAP  ========================

                Data for node: intel1  Num slots: 1    Max slots: 0    Num procs: 1

                        Process OMPI jobid: [31717,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]

                Data for node: intel2  Num slots: 1    Max slots: 0    Num procs: 1

                        Process OMPI jobid: [31717,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]

                =============================================================

                [1474616276.871628] [intel1:7883 :0]        sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2906.98

                [1474616276.903256] [intel2:3181 :0]        sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3043.73

                0: ready to run

                1: ready to run

                ...

                0: finished

                1: finished

                 

                while the one host case does not:

                $ /opt/openmpi-2.0.1-jessie-mxm-mt/bin/mpirun -np 2 --map-by node --display-map -mca pml yalla openmpi_mxm_freeze

                Data for JOB [31494,1] offset 0

                ========================  JOB MAP  ========================

                Data for node: intel1  Num slots: 24  Max slots: 0    Num procs: 2

                        Process OMPI jobid: [31494,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]

                        Process OMPI jobid: [31494,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0-1]]:[../BB/../../../../../../../../../..][../../../../../../../../../../../..]

                =============================================================

                [1474615276.877829] [intel1:7723 :0]        sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2971.04

                [1474615276.877833] [intel1:7724 :0]        sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2971.04

                0: ready to run

                1: ready to run

                ...

                freeze

                 

                Since we are normally using a single host and just in extreme cases two or more hosts, a solution for the single host would be appreciated.