1 Reply Latest reply on Feb 20, 2017 12:09 AM by karen

    MLNX_OFED_LINUX-3.4.2.1.4.1 centos 7.2

    rasmusdotlind

      When I run on 2 nodes I get this error. Any help would be appreciated.

       

      The InfiniBand retry count between two MPI processes has been

      exceeded.  "Retry count" is defined in the InfiniBand spec 1.2

      (section 12.7.38):

       

       

          The total number of times that the sender wishes the receiver to

          retry timeout, packet sequence, etc. errors before posting a

          completion error.

       

       

      This error typically means that there is something awry within the

      InfiniBand fabric itself.  You should note the hosts on which this

      error has occurred; it has been observed that rebooting or removing a

      particular host from the job can sometimes resolve this issue.

       

       

      Two MCA parameters can be used to control Open MPI's behavior with

      respect to the retry count:

       

       

      * btl_openib_ib_retry_count - The number of times the sender will

        attempt to retry (defaulted to 7, the maximum value).

      * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted

        to 20).  The actual timeout value used is calculated as:

       

       

           4.096 microseconds * (2^btl_openib_ib_timeout)

       

       

        See the InfiniBand spec 1.2 (section 12.7.34) for more details.

       

       

      Below is some information about the host that raised the error and the

      peer to which it was connected:

       

       

        Local host:   node1

        Local device: mlx5_0

        Peer host:    node2ib

       

       

      You may need to consult with your system administrator to get this

      problem fixed.

      --------------------------------------------------------------------------

      -------------------------------------------------------

      Primary job  terminated normally, but 1 process returned

      a non-zero exit code.. Per user-direction, the job has been aborted.

      -------------------------------------------------------

      forrtl: error (78): process killed (SIGTERM)

      forrtl: error (78): process killed (SIGTERM)

      forrtl: error (78): process killed (SIGTERM)

      forrtl: error (78): process killed (SIGTERM)

      forrtl: error (78): process killed (SIGTERM)

      forrtl: error (78): process killed (SIGTERM)

      --------------------------------------------------------------------------

      mpirun detected that one or more processes exited with non-zero status, thus causing

      the job to be terminated. The first process to do so was:

       

       

        Process name: [[1590,1],4]

        Exit code:    255

        • Re: MLNX_OFED_LINUX-3.4.2.1.4.1 centos 7.2
          karen

          Hi Rasmus,

           

          In general, retry count error when running MPI jobs may indicate of the fabric health issue.

          You should check and confirm that the firmware and driver levels on the nodes is the same, I also recommend to run ibdiagnet diagnostic tool and send the output to support@mellanox.com in order to confirm the fabric health.

          by invoking the following: ibdiagnet -r -pc -P all=1 --pm_pause_time 600

           

          Regards,

          Karen.