0 Replies Latest reply on Mar 11, 2018 7:00 AM by pasokan

    mellanox openmpi on SR-IOV Enviroment not scaling

    pasokan

      mellanox openmpi on SR-IOV Enviroment not scaling

       

      Its runs with two nodes -np 2 but fails while running on 3 nodes -np 3

       

      [root@vcn01 C]# /usr/mpi/gcc/openmpi-3.1.0rc2/bin/mpirun -x MXM_IB_USE_GRH=y --allow-run-as-root -np 2 --hostfile ./hostlist /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR -a POSIX -w -r -t 1m -b 1m -k -o /tmp/pasokan_fuse/test3

      --------------------------------------------------------------------------

      WARNING: No preset parameters were found for the device that Open MPI

      detected:

       

       

        Local host:            vcn02

        Device name:           mlx5_0

        Device vendor ID:      0x02c9

        Device vendor part ID: 4114

       

       

      Default device parameters will be used, which may result in lower

      performance.  You can edit any of the files specified by the

      btl_openib_device_param_files MCA parameter to set values for your

      device.

       

       

      NOTE: You can turn off this warning by setting the MCA parameter

            btl_openib_warn_no_device_params_found to 0.

      --------------------------------------------------------------------------

      [1520776512.564151] [vcn02:27634:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

      [1520776512.564376] [vcn02:27635:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

      IOR-2.10.3: MPI Coordinated Test of Parallel I/O

       

       

      Run began: Sun Mar 11 09:55:12 2018

      Command line used: /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR -a POSIX -w -r -t 1m -b 1m -k -o /tmp/pasokan_fuse/test3

      Machine: Linux vcn02

       

       

      Summary:

              api                = POSIX

              test filename      = /tmp/pasokan_fuse/test3

              access             = single-shared-file

              ordering in a file = sequential offsets

              ordering inter file= no tasks offsets

              clients            = 2 (2 per node)

              repetitions        = 1

              xfersize           = 1 MiB

              blocksize          = 1 MiB

              aggregate filesize = 2 MiB

       

       

      Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)

      ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------

      write          61.60      61.60       61.60      0.00      61.60      61.60       61.60      0.00   0.03247   EXCEL

      read           37.01      37.01       37.01      0.00      37.01      37.01       37.01      0.00   0.05404   EXCEL

       

       

      Max Write: 61.60 MiB/sec (64.59 MB/sec)

      Max Read:  37.01 MiB/sec (38.81 MB/sec)

       

       

      Run finished: Sun Mar 11 09:55:13 2018

      [vcn01:29341] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found

      [vcn01:29341] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

      [root@vcn01 C]#

       

      3 nodes -np 3

       

      [root@vcn01 C]# /usr/mpi/gcc/openmpi-3.1.0rc2/bin/mpirun -x MXM_IB_USE_GRH=y --allow-run-as-root -np 3 --hostfile ./hostlist /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR -a POSIX -w -r -t 1m -b 1m -k -o /tmp/pasokan_fuse/test3

      --------------------------------------------------------------------------

      WARNING: No preset parameters were found for the device that Open MPI

      detected:

       

       

        Local host:            vcn03

        Device name:           mlx5_0

        Device vendor ID:      0x02c9

        Device vendor part ID: 4114

       

       

      Default device parameters will be used, which may result in lower

      performance.  You can edit any of the files specified by the

      btl_openib_device_param_files MCA parameter to set values for your

      device.

       

       

      NOTE: You can turn off this warning by setting the MCA parameter

            btl_openib_warn_no_device_params_found to 0.

      --------------------------------------------------------------------------

      [1520776629.886041] [vcn03:26896:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

      [1520776629.911893] [vcn02:27731:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

      [1520776629.912087] [vcn02:27730:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

      mlx5: vcn03: got completion with error:

      00000000 00000000 00000000 00000000

      00000000 00000000 00000000 00000000

      00000000 00000000 00000000 00000000

      00000000 78006802 0a000133 000007d2

      [vcn03:26896:0:26896]    ud_verbs.c:305  Fatal: Send completion (wr_id=0xFAAFFAAF with error: local QP operation error

      ==== backtrace ====

      0 0x000000000004e587 uct_ud_verbs_ep_t_init()  ???:0

      1 0x000000000003d96a ucs_callbackq_put_id_noflag()  ???:0

      2 0x00000000000163e2 ucp_worker_progress()  ???:0

      3 0x0000000000003237 mca_pml_ucx_progress()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/pml/ucx/pml_ucx.c:454

      4 0x000000000003282c opal_progress()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/opal/runtime/opal_progress.c:228

      5 0x00000000000c4f39 wait_completion()  hcoll_collectives.c:0

      6 0x00000000000376ad comm_allreduce_hcolrte_generic()  common_allreduce.c:0

      7 0x0000000000037dcb comm_allreduce_hcolrte()  ???:0

      8 0x00000000001ee03e hmca_bcol_ucx_p2p_init_query.part.5()  bcol_ucx_p2p_component.c:0

      9 0x00000000000cdf8c hmca_bcol_base_init()  ???:0

      10 0x00000000000653b8 hmca_coll_ml_init_query()  ???:0

      11 0x00000000000c5d62 hcoll_init_with_opts()  ???:0

      12 0x0000000000005201 mca_coll_hcoll_comm_query()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/hcoll/coll_hcoll_module.c:301

      13 0x00000000000756e5 query_2_0_0()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/base/coll_base_comm_select.c:407

      14 0x00000000000756e5 query()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/base/coll_base_comm_select.c:390

      15 0x00000000000756e5 check_one_component()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/base/coll_base_comm_select.c:352

      16 0x00000000000756e5 check_components()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/base/coll_base_comm_select.c:302

      17 0x00000000000756e5 mca_coll_base_comm_select()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mca/coll/base/coll_base_comm_select.c:125

      18 0x000000000004a853 ompi_mpi_init()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/runtime/ompi_mpi_init.c:918

      19 0x00000000000675db PMPI_Init()  /var/tmp/OFED_topdir/BUILD/openmpi-3.1.0rc2/ompi/mpi/c/profile/pinit.c:66

      20 0x0000000000402cf7 main()  /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR.c:125

      21 0x0000000000021b35 __libc_start_main()  ???:0

      22 0x0000000000402ba9 _start()  ???:0

      ===================

      [vcn03:26896] *** Process received signal ***

      [vcn03:26896] Signal: Aborted (6)

      [vcn03:26896] Signal code:  (-6)

      [vcn03:26896] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7fc43aade370]

      [vcn03:26896] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fc43a7431d7]

      [vcn03:26896] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fc43a7448c8]

      [vcn03:26896] [ 3] /usr/lib64/libucs.so.0(+0x401fa)[0x7fc42743c1fa]

      [vcn03:26896] [ 4] /usr/lib64/libuct.so.0(+0x4e587)[0x7fc42798d587]

      [vcn03:26896] [ 5] /usr/lib64/libucs.so.0(+0x3d96a)[0x7fc42743996a]

      [vcn03:26896] [ 6] /lib64/libucp.so.0(ucp_worker_progress+0x22)[0x7fc427bcf3e2]

      [vcn03:26896] [ 7] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7fc427df9237]

      [vcn03:26896] [ 8] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/libopen-pal.so.40(opal_progress+0x2c)[0x7fc43a18282c]

      [vcn03:26896] [ 9] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xc4f39)[0x7fc4259c1f39]

      [vcn03:26896] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x376ad)[0x7fc4259346ad]

      [vcn03:26896] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7fc425934dcb]

      [vcn03:26896] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1ee03e)[0x7fc425aeb03e]

      [vcn03:26896] [13] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7fc4259caf8c]

      [vcn03:26896] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7fc4259623b8]

      [vcn03:26896] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x242)[0x7fc4259c2d62]

      [vcn03:26896] [16] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x3d1)[0x7fc425e38201]

      [vcn03:26896] [17] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/libmpi.so.40(mca_coll_base_comm_select+0x2d5)[0x7fc43ad606e5]

      [vcn03:26896] [18] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/libmpi.so.40(ompi_mpi_init+0xc43)[0x7fc43ad35853]

      [vcn03:26896] [19] /usr/mpi/gcc/openmpi-3.1.0rc2/lib64/libmpi.so.40(MPI_Init+0x9b)[0x7fc43ad525db]

      [vcn03:26896] [20] /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR[0x402cf7]

      [vcn03:26896] [21] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc43a72fb35]

      [vcn03:26896] [22] /mnt/lustre_client/pasokan/IOR-July12/src/C/IOR[0x402ba9]

      [vcn03:26896] *** End of error message ***

      -------------------------------------------------------

      Primary job  terminated normally, but 1 process returned

      a non-zero exit code. Per user-direction, the job has been aborted.

      -------------------------------------------------------

      --------------------------------------------------------------------------

      mpirun noticed that process rank 2 with PID 26896 on node vcn03 exited on signal 6 (Aborted).

      --------------------------------------------------------------------------

      [vcn01:29350] 2 more processes have sent help message help-mpi-btl-openib.txt / no device params found

      [vcn01:29350] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

      [root@vcn01 C]#

       

      Please help, I tried to compile openmpi but I get a different issue while runtime