2 Replies Latest reply on Sep 8, 2017 8:21 PM by haizhushao

    ibv_reg_mr got file exists error when used nv_peer_mem


      Hi, everyone


      I like to test the GPUDirect with RDMA, so i use  ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:


      cuda-drivers: 384.66

      cuda-toolkit: 375.26

      nv_peer_mem: 1.0.5



      I use perftest tool to do the expeirment.

      server1: ./ib_write_bw -a -F -n10000 --use_cuda

      server2: ./ib_write_bw -a -F -n10000 server1


      but the server1 output error:

      Couldn't allocate MR
      failed to create mr
      Failed to create MR


      at last, i printout the error and errno, the error is 14, and errno is "Bad address".


      can anyone help me, tell me is there any question. thank you very much.

        • Re: ibv_reg_mr got file exists error when used nv_peer_mem

          Hi Haizhu,


          Thank you for contacting the Mellanox Community.


          For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

          Example without CUDA


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits <ip-address-server>



          Example with CUDA


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda <ip-address-server>


          Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.


          For further support, we recommend opening a support case with Mellanox Support.






            • Re: ibv_reg_mr got file exists error when used nv_peer_mem

              Hi Martijin

              Thank you for your reply about the issue.


              I didn't describe the question clearly, the h/w environment is list below:

              1. Hardware:

              ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

              Nvidia K80

              2. Software:

              ubuntu-16.04, kernel 4.8.7

              nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 )

              cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer )

              MLNX_OFED: MLNX_OFED_SRC-debian-4.1-  http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-

              nv_peer_mem: 1.0.5


              I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this , I install nv_peer_mem in server with 80 GPU.

              When i didn't use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: "File has opened". If i didn't insmod nv_peer_mem, ibv_reg_mr got an error: "Bad address".


              The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-, and didn't install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.


              Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.




              Thanks very much for your reply again, and looking forward to your reply.



              Haizhu Shao