2 Replies Latest reply on Sep 8, 2017 8:21 PM by haizhushao

    ibv_reg_mr got file exists error when used nv_peer_mem

    haizhushao

      Hi, everyone

       

      I like to test the GPUDirect with RDMA, so i use  ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:

      kernel-4.8.7

      cuda-drivers: 384.66

      cuda-toolkit: 375.26

      nv_peer_mem: 1.0.5

       

       

      I use perftest tool to do the expeirment.

      server1: ./ib_write_bw -a -F -n10000 --use_cuda

      server2: ./ib_write_bw -a -F -n10000 server1

       

      but the server1 output error:

      Couldn't allocate MR
      failed to create mr
      Failed to create MR
      

       

      at last, i printout the error and errno, the error is 14, and errno is "Bad address".

       

      can anyone help me, tell me is there any question. thank you very much.

        • Re: ibv_reg_mr got file exists error when used nv_peer_mem
          martijn@mellanox.com

          Hi Haizhu,

           

          Thank you for contacting the Mellanox Community.

           

          For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

          Example without CUDA

          Server:

          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

          Client:

          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits <ip-address-server>

           

           

          Example with CUDA

          Server:

          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

          Client:

          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda <ip-address-server>

           

          Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.

           

          For further support, we recommend opening a support case with Mellanox Support.

           

          Thanks.

           

          Cheers,

          ~Martijn

            • Re: ibv_reg_mr got file exists error when used nv_peer_mem
              haizhushao

              Hi Martijin

              Thank you for your reply about the issue.

               

              I didn't describe the question clearly, the h/w environment is list below:

              1. Hardware:

              ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

              Nvidia K80

              2. Software:

              ubuntu-16.04, kernel 4.8.7

              nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 )

              cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer )

              MLNX_OFED: MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz  http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz

              nv_peer_mem: 1.0.5

               

              I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this , I install nv_peer_mem in server with 80 GPU.

              When i didn't use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: "File has opened". If i didn't insmod nv_peer_mem, ibv_reg_mr got an error: "Bad address".

               

              The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-2.0.0.1, and didn't install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.

               

              Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.

               

               

               

              Thanks very much for your reply again, and looking forward to your reply.

               

              Yours

              Haizhu Shao