6 Replies Latest reply on Aug 2, 2018 9:36 AM by ratanb

    MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

    ratanb

      ##Problem: Segmentation fault: invalid permissions for mapped object running mpi with CUDA

       

      ##Configurations

      OS:

      ******************************

      Centos 7.5 (3.10.0-862.el7.x86_64)

       

      Connetivity:

      ******************************

      Back to Back

       

      Softwares:

      ******************************

      cuda-repo-rhel7-9-2-local-9.2.88-1.x86_64

      nccl_2.2.13-1+cuda9.2_x86_64.tar

      MLNX_OFED_LINUX-4.3-3.0.2.1-rhel7.5-x86_64.tgz

      nvidia-peer-memory_1.0-7.tar.gz

      openmpi-3.1.1.tar.bz2

      osu-micro-benchmarks-5.4.2.tar.gz

       

      [root@LOCALNODE ~]# lsmod | grep nv_peer_mem

      nv_peer_mem            13163  0

      ib_core               283851  11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

      nvidia              14019833  9 nv_peer_mem,nvidia_modeset,nvidia_uvm

      [root@LOCALNODE ~]#

       

      ## Steps Followed

      Followed  document : http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf

       

      Openmpi command: mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0:1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

       

      ## Two issues/problem seen where we need help from MNLX

      1. While running osu micro benchmarks between Device to Device (i.e D D ) getting segmentation fault.

      2. Though normal RDMA traffic (ib_send_*) is running fine between both the Nodes and on Both the Ports, But while running osu micro benchmarks, traffic is only going through Port 1 (MLX5_1)

       

      Note: NVidia GPU and Mellanox Adapter are in different NUMA Nodes.

      [root@LOCALNODE ~]# cat /sys/module/mlx5_core/drivers/pci\:mlx5_core/0000\:*/numa_node

      1

      1

      [root@LOCALNODE ~]# cat /sys/module/nvidia/drivers/pci\:nvidia/0000\:*/numa_node

      0

      [root@LOCALNODE ~]# lspci -tv | grep -i nvidia

      |           +-02.0-[19]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]

      [root@LOCALNODE ~]# lspci -tv | grep -i mellanox

      -+-[0000:d7]-+-02.0-[d8]--+-00.0  Mellanox Technologies MT27800 Family [ConnectX-5]

      |           |            \-00.1  Mellanox Technologies MT27800 Family [ConnectX-5]

       

      ## Issue Details:

      ******************************

      Issue 1:

       

      [root@LOCALNODE nccl-tests]# mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

      --------------------------------------------------------------------------

      No OpenFabrics connection schemes reported that they were able to be

      used on a specific port.  As such, the openib BTL (OpenFabrics

      support) will be disabled for this port.

       

        Local host:           LOCALNODE

        Local device:         mlx5_0

        Local port:           1

        CPCs attempted:       rdmacm, udcm

      --------------------------------------------------------------------------

      # OSU MPI-CUDA Latency Test v5.4.1

      # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

      # Size          Latency (us)

      0                       1.20

      [LOCALNODE:5297 :0:5297] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd69ea00000)

      ==== backtrace ====

      0 0x0000000000045e92 ucs_debug_cleanup()  ???:0

      1 0x000000000000f6d0 _L_unlock_13()  funlockfile.c:0

      2 0x0000000000156e50 __memcpy_ssse3_back()  :0

      3 0x00000000000318e1 uct_rc_mlx5_ep_am_short()  ???:0

      4 0x0000000000027a5a ucp_tag_send_nbr()  ???:0

      5 0x0000000000004c71 mca_pml_ucx_send()  ???:0

      6 0x0000000000080202 MPI_Send()  ???:0

      7 0x0000000000401d42 main()  /home/NVIDIA/osu-micro-benchmarks-5.4.2/mpi/pt2pt/osu_latency.c:116

      8 0x0000000000022445 __libc_start_main()  ???:0

      9 0x000000000040205b _start()  ???:0

      ===================

      -------------------------------------------------------

      Primary job  terminated normally, but 1 process returned

      a non-zero exit code. Per user-direction, the job has been aborted.

      -------------------------------------------------------

      --------------------------------------------------------------------------

      mpirun noticed that process rank 0 with PID 0 on node LOCALNODE exited on signal 11 (Segmentation fault).

      --------------------------------------------------------------------------

      [LOCALNODE:05291] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

      [LOCALNODE:05291] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

      [root@LOCALNODE nccl-tests]#

       

      Issue 2:

      [root@LOCALNODE ~]#  cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_*

      0

      0

      0

      0

      0

      0

      0

      0

      0

      0

      0

      [root@LOCALNODE ~]#  cat /sys/class/infiniband/mlx5_1/ports/1/counters/port_*

      0

      18919889

      0

      1011812

      0

      0

      0

      9549739941

      0

      35318041

      0

      [root@LOCALNODE ~]#

       

      Thanks & Regards

      Ratan B

        • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
          jk.yang

          I have encountered this question, too.

          It was because of the ucx do not compile with cuda.(The mlnx install the default ucx).

          When I recompile the ucx with cuda and reinstall it ,It works.

          1 of 1 people found this helpful
            • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
              ratanb

              Thanks a lot for the reply. It solved the above issue but after running mpirun, i do not see any latency difference with and without GDR

               

              My Questions :

              1. Why I do not see any latency difference with and without GDR. ?
              2. Does below sequence or steps correct ? Does it matter for my Question 1

               

              Note: I am having single GPU on both host and peer. Iommu is disabled.

              ## nvidia-smi topo -m

                         GPU0    mlx5_0  mlx5_1  CPU Affinity

              GPU0     X      PHB     PHB     18-35

              mlx5_0  PHB      X      PIX

              mlx5_1  PHB     PIX      X

               

              Steps followed are:

              1. Install CUDA 9.2 and add the library and bin path in .bashrc

              2. Install latest MLX OFED

              3. Compile and Install nv_peer_mem driver

              4. Get UCX from git. Configure UCX with cuda and  Install UCX

              5. Configure Openmpi-3.1.1 and install it.

              ./configure --prefix=/usr/local --with-wrapper-ldflags=-Wl,-rpath,/lib --enable-orterun-prefix-by-default --disable-io-romio --enable-picky --with-cuda=/usr/local/cuda-9.2

              6. Configure OSU Benchmarks-5.4.2 with cuda and install it

              ./configure prefix=/root/osu_benchmarks CC=mpicc --enable-cuda --with-cuda=/usr/local/cuda-9.2

               

              Run mpirun. I do not see any latency difference with and without GDR.

               

              Thanks for your Help.

                • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
                  jk.yang

                  I'm not sure  have you resolved seg 11 problem by my way.

                  As far as I see,I compile the openmpi with my ucx:

                  ./configure --prefix=/usr/local/openmpi-3.1.1 --with-wrapper-ldflags=-Wl,-rpath,/lib --disable-vt --enable-orterun-prefix-by-default -disable-io-romio --enable-picky --with-cuda=/usr/local/cuda  --with-ucx=/opt/ucx-cuda --enable-mem-debug --enable-debug --enable-timing

                  Actually, It will be less latency on GDR. What kind of net card have you been using?CX4 or CX 3?

                  Wish you share some test data and  test environment configuration,it will be great.

                    • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
                      ratanb

                      Yes using your way segmentation fault got resolved.

                      I am using "Mellanox ConnectX-5" adapter.

                      OS - CentOS74

                       

                      Is the below topology looks good to you

                      ## nvidia-smi topo -m

                                 GPU0    mlx5_0  mlx5_1  CPU Affinity

                      GPU0     X      PHB     PHB     18-35

                      mlx5_0  PHB      X      PIX

                      mlx5_1  PHB     PIX      X

                       

                      Running the below command to check the latency

                      mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

                        • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
                          jk.yang

                          PHB:Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

                           

                          Based on the topo you give, the mlx5_1 and mlx5_0  is connected to gpu0 by a PCIe Host Bridge.

                          It meas that , even gdr, the flow from GPU0 to localnode Host,then nic(mlx5_1)  on local node.

                          On remote node,the flow from nic(mlx5_1) to host,then GPU0.

                          At the non-gdr,it just replaces the GPU with mem(DDR).Still, the flow through the host. Maybe that's why it seems the same result.

                          How much is your test latency ?

                           

                          1 of 1 people found this helpful
                            • Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA
                              ratanb

                              Hi Jainkun yang,

                              Sorry for very late reply.

                              I am getting 7 micro seconds latency for the starting Bytes.

                               

                              When i run osu_bw test, i am seeing that System memory is also getting used along with GPU Memory. These seems strange right. With GPUDirect RDMA, we should not see any system memory usage right? Am i missing something?

                              lspcu -tv output is for both the systems

                              +-[0000:80]-+-00.0-[81]--

                              |           +-01.0-[82]--

                              |           +-01.1-[83]--

                              |           +-02.0-[84]--

                              |           +-02.2-[85]----00.0  Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

                              |           +-03.0-[86]----00.0  NVIDIA Corporation Device 15f8

                               

                               

                              On Host Systems:

                              80:02.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02) (prog-if 00 [Normal decode])

                              80:03.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 3 (rev 02) (prog-if 00 [Normal decode])

                               

                              On Peer System:

                              80:02.2 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 (rev 01) (prog-if 00 [Normal decode])

                              80:03.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 (rev 01) (prog-if 00 [Normal decode])

                               

                              Host CPU:

                              # lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                72

                              On-line CPU(s) list:   0-71

                              Thread(s) per core:    2

                              Core(s) per socket:    18

                              Socket(s):             2

                              NUMA node(s):          1

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 63

                              Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

                              Stepping:              2

                              CPU MHz:               1202.199

                              CPU max MHz:           3600.0000

                              CPU min MHz:           1200.0000

                              BogoMIPS:              4590.86

                              Virtualization:        VT-x

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              256K

                              L3 cache:              46080K

                              NUMA node0 CPU(s):     0-71

                              Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

                               

                              Peer CPU:

                               

                              # lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                32

                              On-line CPU(s) list:   0-31

                              Thread(s) per core:    2

                              Core(s) per socket:    8

                              Socket(s):             2

                              NUMA node(s):          1

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 79

                              Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

                              Stepping:              1

                              CPU MHz:               1201.019

                              CPU max MHz:           3000.0000

                              CPU min MHz:           1200.0000

                              BogoMIPS:              4191.23

                              Virtualization:        VT-x

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              256K

                              L3 cache:              20480K

                              NUMA node0 CPU(s):     0-31

                              Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts