Reference Deployment Guide for RDMA over Converged Ethernet (RoCE) Accelerated HPC or ML applications with GPUDirect RDMA on vSphere 6.7 (DRAFT)

Version 26

    This post provides a guide on how to install and configure ML environments with GPUDirect RDMA, Mellanox ConnectX®-4/5/6 VPI PCIe Adapter Cards, Mellanox Spectrum with the Mellanox Onyx OS and running RoCE over a lossless network, in DSCP-based QoS mode.

    This guide assumes VMware ESXi 6.7 Update 1 native and Mellanox Onyx™ version 3.6.8190 and above.






    Mellanox’s Machine Learning

    Mellanox solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build the most efficient machine learning cluster enhanced by RoCE over a 100GbE network.


    Device Partitioning (SR-IOV)

    The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV).

    A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs.

    An ESXi driver and a guest driver are required for SR-IOV.

    Mellanox Technologies supports ESXi SR-IOV for both InfiniBand and RoCE interconnects.

    Please see How To Configure SR-IOV for Mellanox ConnectX® 4/5 adapter cards family on ESXi 6.5/6.7 Server (Native Ethernet) for more information.

    Downsides: No vMotion and Snapshots.


    VM Direct Path I/O

    Allows PCI devices to be accessed directly by guest OS.

    • Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RoCE.

    Full device is made available to a single VM – no sharing.

    No ESXi driver required – just the standard vendor device driver.

    Please see How To Configure Nvidia GPU device into and from VMDirectPath I/O passthrough mode on VMware ESXi 6.x server for more information.

    Downsides: No vMotion and Snapshots.


    Mellanox OFED GPUDirect RDMA

    GPUDirect RDMA is an API between IB CORE and peer memory clients, such as NVIDIA Tesla (Volta, Pascal) class GPU's. It provides access to the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. It works seamlessly using RoCE technology with the Mellanox ConnectX®-4 and later VPI adapters.

    The latest advancement in GPU-GPU communications is GPUDirect RDMA. This new technology provides a direct Peer-to-Peer (P2P) data path between the GPU memory and Mellanox HCA devices. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network.


    Hardware and Software Requirements

    1. A server platform with an adapter card based on one of the following Mellanox Technologies’ ConnectX®-4/5//6 HCA devices.

    2. A switch from the Mellanox Scale-Out SN2000 Ethernet Switch Series.

    3. VMware vSphere 6.7 u1 Cluster installed and configured.

    4. VMware vCenter 6.7 u1.

    5. For using GPUs based on the Pascal and Volta Architectures in pass-through mode:

    6. NVIDIA® Driver.

    7. Installer Privileges: The installation requires administrator privileges on the target machine.



    Setup Overview

    Before you start, make sure you are familiar with VMware vSphere and vCenter deploy and manage procedures.

    This guide does not contain step-by-step instructions for performing all of the required standard vSphere and vCenter installation and configuration tasks because they often depend on customer requirements.

    Make sure you are aware of the Uber Horovod distributed training framework, see GitHub - uber/horovod: Distributed training framework for TensorFlow, Keras, and PyTorch for more info.

    In the distributed TensorFlow/Horovod configuration described in this guide, we are using the following hardware specification.



    Logical Design

    Bill of Materials (BOM)

    In the distributed Spark/HDFS configuration described in this guide, we are using the following hardware specifications.



    Note: This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)



    Physical Network Connections

    vSphere Cluster Design


    Network Configuration

    In our reference we will use a single port per server. If a single port NIC is used, we will wire the available port. If a dual port NIC is used we will wire the 1st port to an Ethernet switch and not use the 2nd port.

    We will cover the procedure later in the Installing Mellanox OFED section.

    Each server is connected to the SN2700 switch by a 100GbE copper cable.

    The switch port connectivity in our case is as follow:


    • Port 1-8 – connected to ESXi servers


    Server names with network configuration provided in the following table.


    Server Type

    Server Name

    IP and NICs

    Internal Network --

    100 GigE

    Management Network --

    1 GigE

    Node 01



    eno0: From DHCP (reserved)

    Node 02



    eno0: From DHCP (reserved)

    Node 03



    eno0: From DHCP (reserved)

    Node 04



    eno0: From DHCP (reserved)

    Node 05



    eno0: From DHCP (reserved)

    Node 06clx-mld-46enp1f0: From DHCP (reserved)
    Node 07clx-mld-47enp1f0: From DHCP (reserved)
    Node 08clx-mld-48enp1f0: From DHCP (reserved)


    Network Switch Configuration


    Note: If you are not familiar with Mellanox switch software, please review the HowTo Get Started with Mellanox Switches guide beforehand in order to upgrade your switch OS to the latest version available. For more information please refer to the Mellanox Onyx User Manual located at or -> Products -> Switch Software -> Mellanox Onyx.


    We will accelerate Spark by using RDMA transport.
    There are several industry standard network configurations for RoCE deployment.

    You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment guide for our recommendations and instructions.

    In our deployment, we will configure our network to be lossless and will use DSCP on the host-side and the switch-side:


    Below is the switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and you may corrupt your existing configuration if any exists.


    switch [standalone: master] > enable

    switch [standalone: master] # configure terminal

    switch [standalone: master] (config) # show running-config


    ## Running database "initial"

    ## Generated at 2018/03/10 09:38:38 +0000

    ## Hostname: swx-mld-1-2




    ## Running-config temporary prefix mode setting


    no cli default prefix-modes enable



    ## License keys


    license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD



    ## Interface Ethernet buffer configuration


    traffic pool roce type lossless

    traffic pool roce memory percent 50.00

    traffic pool roce map switch-priority 3



    ## LLDP configuration





    ## QoS switch configuration


    interface ethernet 1/1-1/32 qos trust L3

    interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500



    ## DCBX ETS configuration


    interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict




    ## Other IP configuration


    hostname swx-mld-1-2



    ## AAA remote server configuration


    # ldap bind-password ********

    # radius-server key ********

    # tacacs-server key ********



    ## Network management configuration


    # web proxy auth basic password ********



    ## X.509 certificates configuration



    # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991

    # (public-cert config omitted since private-key config is hidden)



    ## Persistent prefix mode setting


    cli default prefix-modes enable


    Environment Preparation


    1. Host BIOS Configuration

    • Enable the “above 4G decoding” or “memory mapped I/O above 4GB” or “PCI 64-bit resource handing above 4G” in hosts BIOS
    • Make sure that SR-IOV is enabled
    • Make sure that "Intel Virtualization Technology" is enabled

    2. ESXi Host Software Configuration



    The ConnectX Driver installation procedure on ESXi host is explained here.


    3. VM Template Preparation



    3.1. Configuring EFI Boot Mode

    Before installing the guest OS onto the VM, ensure that the “EFI” is enabled in the Firmware area.

    For correct GPU use, a guest OS within the virtual machine must boot in "EFI" mode.

    To access the setting for this:

    1. Right-click the Virtual Machine and click Edit Settings.
    2. Click VM Options.
    3. Click Boot Options
    4. Select EFI in the Firmware area.

    3.2. Installing the Guest Operating System in the VM

    Install the Ubuntu 16.04 as guest OS into the virtual machine.

    3.3. Install Nvidia driver in the VM

    The standard vendor GPU driver must also be installed within the guest OS.

    3.4. Configure SR-IOV for Mellanox ConnectX® 5 adapter card and Add a Network Adapter to the VM in SR-IOV Mode.

    This post describes how to configure the Mellanox ConnectX driver with an SR-IOV (Ethernet) for ESXi 6.7 Native driver and add the network adapter to the VM in SR-IOV mode.


    3.5. Configure Nvidia GPU device into VMDirectPath I/O passthrough mode and Assign a GPU Device to the VM.

    This post describes how to configure the Nvidia GPU device into and from VMDirectPath I/O pass-through mode on VMware ESXi 6.x server and assign the GPU device to the VM.


    3.6. Adjusting the Memory Mapped I/O Settings for the VM.

    With the above requirements satisfied, two entries must be added to the VM’s VMX file, either by modifying the file directly or by using the vSphere client to add these capabilities. The first entry is:



    The 2nd entry requires a simple calculation. Sum the GPU memory sizes of all GPU devices(*) you intend to pass into the VM and then round up to the next power of two. For example, to use pass-through with two 16GB P100 devices, the value would be: 16+16=32, rounded up to the next power of two to yield 64. Use this value in the 2nd entry:


    With these two changes to the VMX file, follow the vSphere instructions for enabling pass-through devices at the host-level and for specifying which devices should be passed into your VM. The VM should now boot correctly with your device(s) in pass-through mode.


    3.7. Install Mellanox OFED into Virtual Machine Template.


    This post describes how to install Mellanox OFED on Linux.


    Done !


    Deploy and Run a Horovod Framework (Optional)



    Docker Installation and Configuration into VM Template

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine


    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.


    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.


    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.


    Set Up the Repository

    Update the apt package index:

    $ sudo apt-get update


    Install packages to allow apt to use a repository over HTTPS:

    $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common


    Add Docker’s official GPG key:

    $ sudo curl -fsSL | sudo apt-key add -

    Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.

    $ sudo apt-key fingerprint 0EBFCD88
    pub 4096R/0EBFCD88 2017-02-22
    Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88

    uid Docker Release (CE deb) <>
    sub 4096R/F273FCD8 2017-02-22


    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64] $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce


    Customize the docker0 Bridge

    The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it.

    You can specify one or more of the following settings to configure the default bridge network.

    "bip": "",
    "fixed-cidr": "",
    "mtu": 1500,
    "dns": ["",""]

    The same options are presented as flags to dockerd, with an explanation for each:

    • --bip=CIDR: Supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example:

    • --fixed-cidr=CIDR: Restrict the IP range from the docker0 subnet, using standard CIDR notation. For example:

    • --mtu=BYTES: Override the maximum packet length on docker0. For example: 1500.

    • --dns=[]: The DNS servers to use. For example: --dns=,


    Restart Docker after making changes to the daemon.json file.

    $ sudo /etc/init.d/docker restart


    Set communicating to the outside world

    Check that IP forwarding is enabled in the kernel:

    $ sysctl net.ipv4.conf.all.forwarding

    net.ipv4.conf.all.forwarding = 1

    If disabled:

    net.ipv4.conf.all.forwarding = 0

    Please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1


    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.

    To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT


    Add IP Route with Specific Subnet

    On each host you must add routing to container subnets on other hosts. Please see this example for adding routing to on host-41:

    host-41$ sudo ip route add via
    host-41$ sudo ip route add via
    host-41$ sudo ip route add via

    A Quick Check of Each Host

    Give your environment a quick test by spawning a simple container:

    $ docker run hello-world


    Nvidia-docker Deploy into VM Template

    To deploy nvidia-docker on Ubuntu 16.04 please follow these steps:

    1. If you have nvidia-docker 1.0 installed, you need to remove it and all existing GPU containers.

    host-41$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
    host-41$ sudo apt-get purge -y nvidia-docker


    Add the package repositories.

    host-41$ curl -s -L | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

    host-41$ curl -s -L$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list

    sudo apt-get update


    Install nvidia-docker2 and reload the Docker daemon configuration.

    host-41$ sudo apt-get install -y nvidia-docker2

    host-41$ sudo pkill -SIGHUP dockerd


    Test nvidia-smi with the latest official CUDA image.

    host-41$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi



    Horovod Deploy into VM Template


    1. This procedure explains how to build and run a Horovod framework in a Docker Container.
    2. Install additional packages:
      host-41$ sudo apt install libibverbs-dev
      host-41$ sudo apt install libmlx5-dev
    3. Install Mellanox OFED according to this post.

    Horovod VGG 16 Benchmark Results

    Horovod benchmark was ran according to this.