How to Deploy and Run a Horovod framework with Mellanox Network (DRAFT)

Version 5

    This guide provides how to Deploy and Run a Horovod framework with GPUDirect RDMA, Mellanox ConnectX®-4/5 VPI PCI Express Adapter Cards, Mellanox Spectrum with ONYX OS and running RoCE over a lossless network, in DSCP-based QoS mode.

     

    References

    Docker installing and configured into VM Template.

     

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine docker.io

     

    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.

     

    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.

     

    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.

     

    Set Up the repository

    Update the apt package index:

    $ sudo apt-get update

     

    Install packages to allow apt to use a repository over HTTPS:

    $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

     

    Add Docker’s official GPG key:

    $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

     

    Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.

    $ sudo apt-key fingerprint 0EBFCD88
    pub 4096R/0EBFCD88 2017-02-22
    Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
    uid Docker Release (CE deb) <docker@docker.com>
    sub 4096R/F273FCD8 2017-02-22

     

     

    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce

     

    Customize the docker0 bridge

    The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it. You can specify one or more of the following settings to configure the default bridge network

    {
    "bip": "172.16.41.1/24",
    "fixed-cidr": "172.16.41.0/24",
    "mtu": 1500,
    "dns": ["8.8.8.8","8.8.4.4"]
    }

    The same options are presented as flags to dockerd, with an explanation for each:

    • --bip=CIDR: supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example: 172.16.41.1/16.
    • --fixed-cidr=CIDR: restrict the IP range from the docker0 subnet, using standard CIDR notation. For example: 172.16.41.0/16.
    • --mtu=BYTES: override the maximum packet length on docker0. For example: 1500.
    • --dns=[]: The DNS servers to use. For example: --dns=8.8.8.8,8.8.4.4.

    Restart Docker after making changes to the daemon.json file.

    $ sudo /etc/init.d/docker restart

     

    Set communicating to the outside world

    Check ip forwarding is enabled in kernel:

    $ sysctl net.ipv4.conf.all.forwardingnet.ipv4.conf.all.forwarding = 1

    If disabled

    net.ipv4.conf.all.forwarding = 0

    please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1

    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT

     

    Add IP route with specific subnet

    On each host you shall add routing to container subnet on other hosts. Please see example for routing to be added on one host-41:

    host-41$ sudo ip route add 172.16.42.0/24 via 31.13.13.42
    host-41$ sudo ip route add 172.16.43.0/24 via 13.13.13.43host-41$ sudo ip route add 172.16.44.0/24 via 13.13.13.44

    A quick check on each host

    Give your environment a quick test by spawning simple container:$ docker run hello-world

     

    Nvidia-docker Deploy into VM Template.

    To deploy nvidia-docker on Ubuntu 16.04 please go by following steps:

    1. If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
      host-41$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
      host-41$ sudo apt-get purge -y nvidia-docker
    2. Add the package repositories
      host-41$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID)host-41$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.listsudo apt-get update
    3. Install nvidia-docker2 and reload the Docker daemon configuration
      host-41$ sudo apt-get install -y nvidia-docker2host-41$ sudo pkill -SIGHUP dockerd
    4. Test nvidia-smi with the latest official CUDA image
      host-41$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

    Horovod Deploy into VM Template.

     

    1. The procedure is explain how to build and run a Horovod framework in Docker Container.
    2. Install additional packages:
      host-41$ sudo apt install libibverbs-dev
      host-41$ sudo apt install libmlx5-dev
    3. Install Mellanox OFED by the link - How to Installing Mellanox OFED on Linux.


    Horovod VGG 16 Benchmark Results

    Horovod benchmark was ran by the link.