1 Reply Latest reply on Jul 20, 2018 4:08 AM by jk.yang

    MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

    ratanb

      ##Problem: Segmentation fault: invalid permissions for mapped object running mpi with CUDA

       

      ##Configurations

      OS:

      ******************************

      Centos 7.5 (3.10.0-862.el7.x86_64)

       

      Connetivity:

      ******************************

      Back to Back

       

      Softwares:

      ******************************

      cuda-repo-rhel7-9-2-local-9.2.88-1.x86_64

      nccl_2.2.13-1+cuda9.2_x86_64.tar

      MLNX_OFED_LINUX-4.3-3.0.2.1-rhel7.5-x86_64.tgz

      nvidia-peer-memory_1.0-7.tar.gz

      openmpi-3.1.1.tar.bz2

      osu-micro-benchmarks-5.4.2.tar.gz

       

      [root@LOCALNODE ~]# lsmod | grep nv_peer_mem

      nv_peer_mem            13163  0

      ib_core               283851  11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

      nvidia              14019833  9 nv_peer_mem,nvidia_modeset,nvidia_uvm

      [root@LOCALNODE ~]#

       

      ## Steps Followed

      Followed  document : http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf

       

      Openmpi command: mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0:1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

       

      ## Two issues/problem seen where we need help from MNLX

      1. While running osu micro benchmarks between Device to Device (i.e D D ) getting segmentation fault.

      2. Though normal RDMA traffic (ib_send_*) is running fine between both the Nodes and on Both the Ports, But while running osu micro benchmarks, traffic is only going through Port 1 (MLX5_1)

       

      Note: NVidia GPU and Mellanox Adapter are in different NUMA Nodes.

      [root@LOCALNODE ~]# cat /sys/module/mlx5_core/drivers/pci\:mlx5_core/0000\:*/numa_node

      1

      1

      [root@LOCALNODE ~]# cat /sys/module/nvidia/drivers/pci\:nvidia/0000\:*/numa_node

      0

      [root@LOCALNODE ~]# lspci -tv | grep -i nvidia

      |           +-02.0-[19]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]

      [root@LOCALNODE ~]# lspci -tv | grep -i mellanox

      -+-[0000:d7]-+-02.0-[d8]--+-00.0  Mellanox Technologies MT27800 Family [ConnectX-5]

      |           |            \-00.1  Mellanox Technologies MT27800 Family [ConnectX-5]

       

      ## Issue Details:

      ******************************

      Issue 1:

       

      [root@LOCALNODE nccl-tests]# mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

      --------------------------------------------------------------------------

      No OpenFabrics connection schemes reported that they were able to be

      used on a specific port.  As such, the openib BTL (OpenFabrics

      support) will be disabled for this port.

       

        Local host:           LOCALNODE

        Local device:         mlx5_0

        Local port:           1

        CPCs attempted:       rdmacm, udcm

      --------------------------------------------------------------------------

      # OSU MPI-CUDA Latency Test v5.4.1

      # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

      # Size          Latency (us)

      0                       1.20

      [LOCALNODE:5297 :0:5297] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd69ea00000)

      ==== backtrace ====

      0 0x0000000000045e92 ucs_debug_cleanup()  ???:0

      1 0x000000000000f6d0 _L_unlock_13()  funlockfile.c:0

      2 0x0000000000156e50 __memcpy_ssse3_back()  :0

      3 0x00000000000318e1 uct_rc_mlx5_ep_am_short()  ???:0

      4 0x0000000000027a5a ucp_tag_send_nbr()  ???:0

      5 0x0000000000004c71 mca_pml_ucx_send()  ???:0

      6 0x0000000000080202 MPI_Send()  ???:0

      7 0x0000000000401d42 main()  /home/NVIDIA/osu-micro-benchmarks-5.4.2/mpi/pt2pt/osu_latency.c:116

      8 0x0000000000022445 __libc_start_main()  ???:0

      9 0x000000000040205b _start()  ???:0

      ===================

      -------------------------------------------------------

      Primary job  terminated normally, but 1 process returned

      a non-zero exit code. Per user-direction, the job has been aborted.

      -------------------------------------------------------

      --------------------------------------------------------------------------

      mpirun noticed that process rank 0 with PID 0 on node LOCALNODE exited on signal 11 (Segmentation fault).

      --------------------------------------------------------------------------

      [LOCALNODE:05291] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

      [LOCALNODE:05291] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

      [root@LOCALNODE nccl-tests]#

       

      Issue 2:

      [root@LOCALNODE ~]#  cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_*

      0

      0

      0

      0

      0

      0

      0

      0

      0

      0

      0

      [root@LOCALNODE ~]#  cat /sys/class/infiniband/mlx5_1/ports/1/counters/port_*

      0

      18919889

      0

      1011812

      0

      0

      0

      9549739941

      0

      35318041

      0

      [root@LOCALNODE ~]#

       

      Thanks & Regards

      Ratan B