    ibv_reg_mr got file exists error when used nv_peer_mem


      Hi, everyone


      I like to test the GPUDirect with RDMA, so i use  ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:


      cuda-drivers: 384.66

      cuda-toolkit: 375.26

      nv_peer_mem: 1.0.5



      I use perftest tool to do the expeirment.

      server1: ./ib_write_bw -a -F -n10000 --use_cuda

      server2: ./ib_write_bw -a -F -n10000 server1


      but the server1 output error:

      Couldn't allocate MR
      failed to create mr
      Failed to create MR


      at last, i printout the error and errno, the error is 14, and errno is "Bad address".


      can anyone help me, tell me is there any question. thank you very much.

          Hi Haizhu,


          Thank you for contacting the Mellanox Community.


          For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

          Example without CUDA


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits <ip-address-server>



          Example with CUDA


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda


          # ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda <ip-address-server>


          Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.


          For further support, we recommend opening a support case with Mellanox Support.






              Hi Martijin

              Thank you for your reply about the issue.


              I didn't describe the question clearly, the h/w environment is list below:

              1. Hardware:

              ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

              Nvidia K80

              2. Software:

              ubuntu-16.04, kernel 4.8.7

              nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 )

              cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer )

              MLNX_OFED: MLNX_OFED_SRC-debian-4.1-  http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-

              nv_peer_mem: 1.0.5


              I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this , I install nv_peer_mem in server with 80 GPU.

              When i didn't use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: "File has opened". If i didn't insmod nv_peer_mem, ibv_reg_mr got an error: "Bad address".


              The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-, and didn't install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.


              Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.




              Thanks very much for your reply again, and looking forward to your reply.



              Haizhu Shao