0 Replies Latest reply on Mar 13, 2018 9:06 AM by pasokan

    Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]

    pasokan

      Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]

       

      I'm using Mellanox 56G FDR with SRIOV on KVM virtualization, and I want to use the RDMA to communicate between VM with FDR Virtual Function.

       

      • Operating system/version: CentsOS 7.3
      • Computer hardware: KVM Virtualization
      • Network type: 56G FDR -- Virtual Function
      • OpenMPI Version - Open MPI

      Build Openmpi

       

      wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz

      tar -zxf openmpi-3.0.0.tar.gz

      mv openmpi-3.0.0 openmpi-3.0.0-src

      mkdir openmpi-3.0.0

      ./configure --prefix=/mnt/lustre_client/pasokan/openmpi-3.0.0/openmpi-3.0.0

      make all install

       

      on one node ./IOR running with OpenMPI but with two node it fails with "][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2"

       

        One Node

       

       

        [root@vcn03 C]# mpirun --allow-run-as-root -np 1 -host vcn03 ./IOR
        --------------------------------------------------------------------------

        WARNING: No preset parameters were found for the device that Open MPI

        detected:

       

        Local host: vcn03

        Device name: mlx5_0

        Device vendor ID: 0x02c9

        Device vendor part ID: 4114

       

        Default device parameters will be used, which may result in lower

        performance. You can edit any of the files specified by the

        btl_openib_device_param_files MCA parameter to set values for your

        device.

       

        NOTE: You can turn off this warning by setting the MCA parameter

        btl_openib_warn_no_device_params_found to 0.

        --------------------------------------------------------------------------

        [vcn03][[33605,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

        IOR-2.10.3: MPI Coordinated Test of Parallel I/O

       

        Run began: Tue Mar 13 11:50:15 2018

        Command line used: ./IOR

        Machine: Linux vcn03

       

        Summary:

        api = POSIX

        test filename = testFile

        access = single-shared-file

        ordering in a file = sequential offsets

        ordering inter file= no tasks offsets

        clients = 1 (1 per node)

        repetitions = 1

        xfersize = 262144 bytes

        blocksize = 1 MiB

        aggregate filesize = 1 MiB

       

        Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)

        --------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------

        write 312.36 312.36 312.36 0.00 1249.44 1249.44 1249.44 0.00 0.00320 EXCEL

        read 996.42 996.42 996.42 0.00 3985.69 3985.69 3985.69 0.00 0.00100 EXCEL

       

        Max Write: 312.36 MiB/sec (327.53 MB/sec)

        Max Read: 996.42 MiB/sec (1044.82 MB/sec)

       

        Run finished: Tue Mar 13 11:50:15 2018

       

       

        two node run

       

        [root@vcn03 C]# mpirun --allow-run-as-root -np 2 -host vcn03,vcn04 ./IOR
        --------------------------------------------------------------------------

        WARNING: No preset parameters were found for the device that Open MPI

        detected:

       

        Local host: vcn04

        Device name: mlx5_0

        Device vendor ID: 0x02c9

        Device vendor part ID: 4114

       

        Default device parameters will be used, which may result in lower

        performance. You can edit any of the files specified by the

        btl_openib_device_param_files MCA parameter to set values for your

        device.

       

        NOTE: You can turn off this warning by setting the MCA parameter

        btl_openib_warn_no_device_params_found to 0.

        --------------------------------------------------------------------------

        [vcn03][[33640,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

        [vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

        mlx5: vcn04: got completion with error:

        00000000 00000000 00000000 00000000

        00000000 00000000 00000000 00000000

        00000000 00000000 00000000 00000000

        00000000 78006802 0a00016f 00005bd2

        [vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2

        [vcn04:28705] *** An error occurred in MPI_Send

        [vcn04:28705] *** reported by process [2204631041,1]

        [vcn04:28705] *** on communicator MPI_COMM_WORLD

        [vcn04:28705] *** MPI_ERR_OTHER: known error not in list

        [vcn04:28705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

        [vcn04:28705] *** and potentially your MPI job)

        [vcn03:05349] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found

        [vcn03:05349] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

        [root@vcn03 C]#