Reference Deployment Guide for RDMA over Converged Ethernet (RoCE) Accelerated HPC or ML applications with GPUDirect RDMA on vSphere 6.7 (DRAFT)

Version 26

    This post provides a guide on how to install and configure ML environments with GPUDirect RDMA, Mellanox ConnectX®-4/5/6 VPI PCIe Adapter Cards, Mellanox Spectrum with the Mellanox Onyx OS and running RoCE over a lossless network, in DSCP-based QoS mode.

    This guide assumes VMware ESXi 6.7 Update 1 native and Mellanox Onyx™ version 3.6.8190 and above.

     

     

    References

     

    Overview

    Mellanox’s Machine Learning

    Mellanox solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA, JD.com, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build the most efficient machine learning cluster enhanced by RoCE over a 100GbE network.

     

    Device Partitioning (SR-IOV)

    The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV).

    A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs.

    An ESXi driver and a guest driver are required for SR-IOV.

    Mellanox Technologies supports ESXi SR-IOV for both InfiniBand and RoCE interconnects.

    Please see How To Configure SR-IOV for Mellanox ConnectX® 4/5 adapter cards family on ESXi 6.5/6.7 Server (Native Ethernet) for more information.

    Downsides: No vMotion and Snapshots.

     

    VM Direct Path I/O

    Allows PCI devices to be accessed directly by guest OS.

    • Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RoCE.

    Full device is made available to a single VM – no sharing.

    No ESXi driver required – just the standard vendor device driver.

    Please see How To Configure Nvidia GPU device into and from VMDirectPath I/O passthrough mode on VMware ESXi 6.x server for more information.

    Downsides: No vMotion and Snapshots.

     

    Mellanox OFED GPUDirect RDMA

    GPUDirect RDMA is an API between IB CORE and peer memory clients, such as NVIDIA Tesla (Volta, Pascal) class GPU's. It provides access to the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. It works seamlessly using RoCE technology with the Mellanox ConnectX®-4 and later VPI adapters.

    The latest advancement in GPU-GPU communications is GPUDirect RDMA. This new technology provides a direct Peer-to-Peer (P2P) data path between the GPU memory and Mellanox HCA devices. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network.

     

    Hardware and Software Requirements

    1. A server platform with an adapter card based on one of the following Mellanox Technologies’ ConnectX®-4/5//6 HCA devices.

    2. A switch from the Mellanox Scale-Out SN2000 Ethernet Switch Series.

    3. VMware vSphere 6.7 u1 Cluster installed and configured.

    4. VMware vCenter 6.7 u1.

    5. For using GPUs based on the Pascal and Volta Architectures in pass-through mode:

    6. NVIDIA® Driver.

    7. Installer Privileges: The installation requires administrator privileges on the target machine.

     

     

    Setup Overview

    Before you start, make sure you are familiar with VMware vSphere and vCenter deploy and manage procedures.

    This guide does not contain step-by-step instructions for performing all of the required standard vSphere and vCenter installation and configuration tasks because they often depend on customer requirements.

    Make sure you are aware of the Uber Horovod distributed training framework, see GitHub - uber/horovod: Distributed training framework for TensorFlow, Keras, and PyTorch for more info.

    In the distributed TensorFlow/Horovod configuration described in this guide, we are using the following hardware specification.

     

    Equipment

    Logical Design

    Bill of Materials (BOM)

    In the distributed Spark/HDFS configuration described in this guide, we are using the following hardware specifications.

     

     

    Note: This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

     

     

    Physical Network Connections

    vSphere Cluster Design

     

    Network Configuration

    In our reference we will use a single port per server. If a single port NIC is used, we will wire the available port. If a dual port NIC is used we will wire the 1st port to an Ethernet switch and not use the 2nd port.

    We will cover the procedure later in the Installing Mellanox OFED section.

    Each server is connected to the SN2700 switch by a 100GbE copper cable.

    The switch port connectivity in our case is as follow:

     

    • Port 1-8 – connected to ESXi servers

     

    Server names with network configuration provided in the following table.

     

    Server Type

    Server Name

    IP and NICs

    Internal Network --

    100 GigE

    Management Network --

    1 GigE

    Node 01

    clx-mld-41

    enp1f0: 31.31.31.41

    eno0: From DHCP (reserved)

    Node 02

    clx-mld-42

    enp1f0: 31.31.31.42

    eno0: From DHCP (reserved)

    Node 03

    clx-mld-43

    enp1f0: 31.31.31.43

    eno0: From DHCP (reserved)

    Node 04

    clx-mld-44

    enp1f0: 31.31.31.44

    eno0: From DHCP (reserved)

    Node 05

    clx-mld-45

    enp1f0: 31.31.31.45

    eno0: From DHCP (reserved)

    Node 06clx-mld-46enp1f0: 31.31.31.46eno0: From DHCP (reserved)
    Node 07clx-mld-47enp1f0: 31.31.31.47eno0: From DHCP (reserved)
    Node 08clx-mld-48enp1f0: 31.31.31.48eno0: From DHCP (reserved)

     

    Network Switch Configuration

     

    Note: If you are not familiar with Mellanox switch software, please review the HowTo Get Started with Mellanox Switches guide beforehand in order to upgrade your switch OS to the latest version available. For more information please refer to the Mellanox Onyx User Manual located at support.mellanox.com or www.mellanox.com -> Products -> Switch Software -> Mellanox Onyx.

     

    We will accelerate Spark by using RDMA transport.
    There are several industry standard network configurations for RoCE deployment.

    You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment guide for our recommendations and instructions.

    In our deployment, we will configure our network to be lossless and will use DSCP on the host-side and the switch-side:

     

    Below is the switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and you may corrupt your existing configuration if any exists.

     

    switch [standalone: master] > enable

    switch [standalone: master] # configure terminal

    switch [standalone: master] (config) # show running-config

    ##

    ## Running database "initial"

    ## Generated at 2018/03/10 09:38:38 +0000

    ## Hostname: swx-mld-1-2

    ##

     

    ##

    ## Running-config temporary prefix mode setting

    ##

    no cli default prefix-modes enable

     

    ##

    ## License keys

    ##

    license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD

     

    ##

    ## Interface Ethernet buffer configuration

    ##

    traffic pool roce type lossless

    traffic pool roce memory percent 50.00

    traffic pool roce map switch-priority 3

     

    ##

    ## LLDP configuration

    ##

    lldp

     

    ##

    ## QoS switch configuration

    ##

    interface ethernet 1/1-1/32 qos trust L3

    interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

     

    ##

    ## DCBX ETS configuration

    ##

    interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

     

    ##

    ## Other IP configuration

    ##

    hostname swx-mld-1-2

     

    ##

    ## AAA remote server configuration

    ##

    # ldap bind-password ********

    # radius-server key ********

    # tacacs-server key ********

     

    ##

    ## Network management configuration

    ##

    # web proxy auth basic password ********

     

    ##

    ## X.509 certificates configuration

    ##

    #

    # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991

    # (public-cert config omitted since private-key config is hidden)

     

    ##

    ## Persistent prefix mode setting

    ##

    cli default prefix-modes enable

     

    Environment Preparation

     

    1. Host BIOS Configuration

    • Enable the “above 4G decoding” or “memory mapped I/O above 4GB” or “PCI 64-bit resource handing above 4G” in hosts BIOS
    • Make sure that SR-IOV is enabled
    • Make sure that "Intel Virtualization Technology" is enabled

    2. ESXi Host Software Configuration

     

     

    The ConnectX Driver installation procedure on ESXi host is explained here.

     

    3. VM Template Preparation

     

     

    3.1. Configuring EFI Boot Mode

    Before installing the guest OS onto the VM, ensure that the “EFI” is enabled in the Firmware area.

    For correct GPU use, a guest OS within the virtual machine must boot in "EFI" mode.

    To access the setting for this:

    1. Right-click the Virtual Machine and click Edit Settings.
    2. Click VM Options.
    3. Click Boot Options
    4. Select EFI in the Firmware area.

    3.2. Installing the Guest Operating System in the VM

    Install the Ubuntu 16.04 as guest OS into the virtual machine.

    3.3. Install Nvidia driver in the VM

    The standard vendor GPU driver must also be installed within the guest OS.

    3.4. Configure SR-IOV for Mellanox ConnectX® 5 adapter card and Add a Network Adapter to the VM in SR-IOV Mode.

    This post describes how to configure the Mellanox ConnectX driver with an SR-IOV (Ethernet) for ESXi 6.7 Native driver and add the network adapter to the VM in SR-IOV mode.

     

    3.5. Configure Nvidia GPU device into VMDirectPath I/O passthrough mode and Assign a GPU Device to the VM.

    This post describes how to configure the Nvidia GPU device into and from VMDirectPath I/O pass-through mode on VMware ESXi 6.x server and assign the GPU device to the VM.

     

    3.6. Adjusting the Memory Mapped I/O Settings for the VM.

    With the above requirements satisfied, two entries must be added to the VM’s VMX file, either by modifying the file directly or by using the vSphere client to add these capabilities. The first entry is:

    pciPassthru.use64bitMMIO=“TRUE”

     

    The 2nd entry requires a simple calculation. Sum the GPU memory sizes of all GPU devices(*) you intend to pass into the VM and then round up to the next power of two. For example, to use pass-through with two 16GB P100 devices, the value would be: 16+16=32, rounded up to the next power of two to yield 64. Use this value in the 2nd entry:

    pciPassthru.64bitMMIOSizeGB=“64”

    With these two changes to the VMX file, follow the vSphere instructions for enabling pass-through devices at the host-level and for specifying which devices should be passed into your VM. The VM should now boot correctly with your device(s) in pass-through mode.

     

    3.7. Install Mellanox OFED into Virtual Machine Template.

     

    This post describes how to install Mellanox OFED on Linux.

     

    Done !

     

    Deploy and Run a Horovod Framework (Optional)

     

     

    Docker Installation and Configuration into VM Template

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine docker.io

     

    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.

     

    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.

     

    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.

     

    Set Up the Repository

    Update the apt package index:

    $ sudo apt-get update

     

    Install packages to allow apt to use a repository over HTTPS:

    $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

     

    Add Docker’s official GPG key:

    $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

    Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.

    $ sudo apt-key fingerprint 0EBFCD88
    pub 4096R/0EBFCD88 2017-02-22
    Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88

    uid Docker Release (CE deb) <docker@docker.com>
    sub 4096R/F273FCD8 2017-02-22

     

    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce

     

    Customize the docker0 Bridge

    The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it.

    You can specify one or more of the following settings to configure the default bridge network.

    {
    "bip": "172.16.41.1/24",
    "fixed-cidr": "172.16.41.0/24",
    "mtu": 1500,
    "dns": ["8.8.8.8","8.8.4.4"]
    }

    The same options are presented as flags to dockerd, with an explanation for each:

    • --bip=CIDR: Supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example: 172.16.41.1/16.

    • --fixed-cidr=CIDR: Restrict the IP range from the docker0 subnet, using standard CIDR notation. For example: 172.16.41.0/16.

    • --mtu=BYTES: Override the maximum packet length on docker0. For example: 1500.

    • --dns=[]: The DNS servers to use. For example: --dns=8.8.8.8,8.8.4.4.

     

    Restart Docker after making changes to the daemon.json file.

    $ sudo /etc/init.d/docker restart

     

    Set communicating to the outside world

    Check that IP forwarding is enabled in the kernel:

    $ sysctl net.ipv4.conf.all.forwarding

    net.ipv4.conf.all.forwarding = 1

    If disabled:

    net.ipv4.conf.all.forwarding = 0

    Please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1

     

    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.

    To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT

     

    Add IP Route with Specific Subnet

    On each host you must add routing to container subnets on other hosts. Please see this example for adding routing to on host-41:

    host-41$ sudo ip route add 172.16.42.0/24 via 31.13.13.42
    host-41$ sudo ip route add 172.16.43.0/24 via 13.13.13.43
    host-41$ sudo ip route add 172.16.44.0/24 via 13.13.13.44

    A Quick Check of Each Host

    Give your environment a quick test by spawning a simple container:

    $ docker run hello-world

     

    Nvidia-docker Deploy into VM Template

    To deploy nvidia-docker on Ubuntu 16.04 please follow these steps:

    1. If you have nvidia-docker 1.0 installed, you need to remove it and all existing GPU containers.

    host-41$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
    host-41$ sudo apt-get purge -y nvidia-docker

     

    Add the package repositories.

    host-41$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

    host-41$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list

    sudo apt-get update

     

    Install nvidia-docker2 and reload the Docker daemon configuration.

    host-41$ sudo apt-get install -y nvidia-docker2

    host-41$ sudo pkill -SIGHUP dockerd

     

    Test nvidia-smi with the latest official CUDA image.

    host-41$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

     

     

    Horovod Deploy into VM Template

     

    1. This procedure explains how to build and run a Horovod framework in a Docker Container.
    2. Install additional packages:
      host-41$ sudo apt install libibverbs-dev
      host-41$ sudo apt install libmlx5-dev
    3. Install Mellanox OFED according to this post.


    Horovod VGG 16 Benchmark Results

    Horovod benchmark was ran according to this.https://github.com/uber/horovod/blob/master/docs/running.md