As mentioned in another HPC blog post earlier in the month, that GPUDirect RDMA Beta is now publicly available. Here, I want to share with you some quick examples of running GPUDirect RDMA on the Kepler class NVIDIA GPUs and the latest Mellanox Connect-IB InfiniBand HCA with a couple of free, open source MPI implementations out there.

 

GPUDirect RDMA is part of the Mellanox PeerDirect(TM) offering, which provides access for the HCA to read/write peer memory data buffers, which enables RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need for copying data back and forth between the host memory.

 

1. Prerequisites

 

It is required to have the following components to run GPUDirect RDMA:

 

Hardware:

Software:

 

2. Some Important Notes

 

Once the hardware and software components are installed, it is important to check that the GPUDirect kernel module is properly loaded on each of the compute systems where you plan to run the job that requires the GPUDirect RDMA feature. To check:

service nv_peer_mem status

Or for some other flavors of Linux:

lsmod | grep nv_peer_mem

 

Usually this kernel module is set to load by default by the system startup service. If not loaded, GPUDirect RDMA would not work, which would result in very high latency for message communications.

 

One you start the module by either:

service nv_peer_mem start

Or for some other flavors of Linux:

modprobe nv_peer_mem

 

Also important to note that to achieve the best performance for GPUDirect RDMA, it is required that both the HCA and the GPU must physically sit on same PCIe IO root complex. To find out about the system architecture by either consulting with the system manual, or use command such as "lspci -tv" to make sure that this is the case.

 

3. CUDA-Enabled Tests

 

One of the ways to try GPUDirect RDMA is by running the micro-benchmarks from Ohio State University (OSU). The OSU Benchmarks 4.2 is a CUDA-enabled benchmark that can downloded from the OSU benchmark page:

Benchmarks | Network-Based Computing Laboratory

 

When building the OSU benchmarks, one needs to be sure that proper flags set to enable the CUDA part of the tests, otherwise the tests will only run using the host memory instead which is the default.

 

./configure CC=/path/to/mpicc \

--enable-cuda \

--with-cuda-include=/path/to/cuda/include \

--with-cuda-libpath=/path/to/cuda/lib

make

make install

 

4. Running with MVAPICH-GDR 2.0b

 

Earlier last week, the MVAPICH team from the OSU has released a version of MVAPICH2 that takes advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand interconnect. The new version is called MVAPICH-GDR 2.0b, which can be downloaded from this URL here:

MVAPICH2-GDR | Download | Network-Based Computing Laboratory

 

Below is an example of running one of the the OSU benchmark which enables GPUDirect RDMA.

 

[gdr@ops001 ~]$ mpirun_rsh -np 2 ops001 ops002 MV2_USE_CUDA=1 MV2_USE_GPUDIRECT=1 /home/gdr/osu-micro-benchmarks-4.2-mvapich2/mpi/pt2pt/osu_bw -d cuda D D

# OSU MPI-CUDA Bandwidth Test v4.2

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size      Bandwidth (MB/s)

...

2097152              6372.60

4194304              6388.63

 

The MV2_GPUDIRECT_LIMIT is a tunable parameter which controls the buffer size that it starts to use.

 

Here is a list of runtime parameters that can be used for process-to-rail binding in case the system has multi-rail configuration:

 

export MV2_USE_CUDA=1

export MV2_USE_GPUDIRECT=1

export MV2_RAIL_SHARING_POLICY=FIXED_MAPPING

export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1

export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G

export MV2_CPU_BINDING_LEVEL=SOCKET

export MV2_CPU_BINDING_POLICY=SCATTER

 

Additional tuning parameters related to CUDA and GPUDirect RDMA (such as MV2_CUDA_BLOCK_SIZE) can be found in the README installed on the node:

 

/opt/mvapich2/gdr/2.0/gnu/share/doc/mvapich2-gdr-gnu-2.0/README-GDR

 

5. Running with Open MPI 1.7.4

 

The GPUDirect RDMA support is available on Open MPI 1.7.4rc1. Unlike MVAPICH2-GDR which is available in the RPM format, one can download the source code for Open MPI and compile using flags below to enable GPUDirect RDMA support:

 

[co-mell1@login-sand8 ~]$ ../configure --prefix=/path/to/openmpi-1.7.4rc1/install \

--with-wrapper-ldflags=-Wl,-rpath,/lib --disable-vt --enable-orterun-prefix-by-default -disable-io-romio --enable-picky \

--with-cuda=/usr/local/cuda-5.5 \

--with-cuda-include=/usr/local/cuda-5.5/include \

--with-cuda-libpath=/usr/local/cuda-5.5/lib64

[co-mell1@login-sand8 ~]$ make; make install

 

To run the Open MPI job that uses the flag that enables GPUDirect RDMA:

 

[gdr@jupiter001 ~]$ mpirun -host jupiter001,jupiter002 -mca btl_openib_want_cuda_gdr 1 -np 2 -npernode 1 -x LD_LIBRARY_PATH -mca btl_openib_if_include mlx5_0:1 -bind-to-core -report-bindings -mca coll_fca_enable 0 -x CUDA_VISIBLE_DEVICES=0 /home/co-mell1/scratch/osu-micro-benchmarks-4.2/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

# OSU MPI-CUDA Latency Test v4.2

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size          Latency (us)

0                       1.08

1                       3.83

2                       3.83

4                       3.84

8                       3.83

16                      3.83

32                      3.82

64                      3.80

...

 

Note that if the flag for GPUDirect RDMA is not enabled, it would result in much higher latency for the above.

 

Also want to note that additional tunable for adjusting the message size is available ib this Open MPI FAQ:Running MPI jobs. Look for the section regarding MPI CUDA Support.

 

By default in OMPI 1.7.4, the GPUDirect RDMA will work for message sizes between 0 to 30KB. For messages above that limit, it will be switched to use asynchronous copies through the host memory instead. Sometimes better application performance can be seen by adjusting that limit. Here is an example of increasing to adjust the switchover point to above 64KB:

 

-mca btl_openib_cuda_rdma_limit 65537