As a result of the NVIDIA co-development effort with Mellanox Technologies, Mellanox provides support for GPUDirect technology, that eliminates CPU bandwidth and latency bottlenecks using direct memory access (DMA) between GPUs and Mellanox HCAs, resulting in significantly improved RDMA applications such as MPI.
Note: This post is outdated. Refer to http://www.mellanox.com/page/products_dyn?product_family=116&mtag=gpudirect for updated info about GPUDirect.
The GPUDirect project - announced Nov 2009
- “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”, http://www.nvidia.com/object/io_1258539409179.html
- GPUDirect - developed by Mellanox and NVIDIA
- New interface (API) within the Tesla GPU driver
- New interface within the Mellanox InfiniBand drivers
- Linux kernel modification to allow direct communication between drivers
GPUDirect 1.0 - announced Q2’10
- Accelerated Communication With Network And Storage Devices
- Avoid unnecessary system memory copies and CPU overhead by copying data directly to/from pinned CUDA host memory
- “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”
- In first stages was available as separate Mellanox OFED GPUDirect package, with recent Linux kernels, the regular Linux MLNX_OFED is sufficient.
GPUDirect RDMA - Today
Allows the HCA to directly zero-copy from/to the GPU memory, resulting in completely bypassing the host memory. This feature requires CUDA 5.0 or later, as well as MLNX_OFED package with the suitable hooks.
Frequently Asked Questions:
- Where can I have access to CUDA GPUDirect Peer-to-Peer (P2P) API?
You must use CUDA 5.0 or later, check NVIDIA developer guide for more details.
- What cards are supported for GPUDirect RDMA?
HCA: ConnectX Family, GPU: Kepler Class
- What Software components are GPUDirect-RDMA aware?
- OS: supported on Linux only; no changes required in the kernel.
- HCA Driver: you must use compatible MLNX OFED driver
- GPU Driver: use CUDA 5.0 or later
- RDMA Application:
- If you're using the RDMA verbs directly, then yes; the application should be aware of CUDA GPU allocations (for example, MPI layer should be GPUDirect-RDMA-aware, such as MVAPICH). The changes basically involve allocating memory on the GPU (using cudaMalloc method) and passing the allocated virtual address to the HCA (using ibv_reg_mr method)
- However, if you're using RDMA indirectly (for example, an MPI application running on top MVAPICH2), then there is no need for any changes in the application itself, the user can simply use the normal MPI_Send/MPI_Recv functions (see this presentation for MVAPICH2 changes).
- How should I install the CPU/GPU on my system to enable GPUDirect RDMA?
- From CUDA Toolkit Documentation webpage:
- We can distinguish between three situations:
- PCIe switches only
- single CPU/IOH
- CPU/IOH <-> QPI/HT <-> CPU/IOH
- The first situation, where there are only PCIe switches on the path, is optimal and yields the best performance. The second one, where a single CPU/IOH is involved, works, but yields worse performance. Finally, the third situation, where the path traverses a QPI/HT link, doesn't work reliably.
- Is there any documentation for GPUDirect RDMA?
Check NVIDIA CUDA Toolkit Documentation