Reference Deployment Guide for RDMA accelerated TensorFlow 1.6 with an NVIDIA GPU Card over 100Gb Infiniband Network running on Docker Containers

Version 9

    In this document we will demonstrate a distributed deployment procedure of RDMA accelerated TensorFlow running on Docker Containers and Mellanox end-to-end 100 Gb/s Infiniband (IB) solution.

    This document describes the process of building the TensorFlow 1.6.0 GA from sources for Ubuntu 16.04.2 LTS and Docker 17.12 on four physical servers.






    What is TensorFlow ?

    TensorFlow is an open source software library developed by the Google Brain team for the purpose of conducting machine learning and deep neural networks research. The library performs numerical computation by using data flow graphs, where the nodes in the graph represent mathematical operations and the graph edges represent the multidimensional data arrays (tensors) which communicate between the nodes. TensorFlow supports Cuda 9.1 & CuDNN 7.0 (req. registration), in this guide we will use the installing from sources from their website for a much easier installation. In order to use TensorFlow with GPU support, you must have an NVIDIA GPU with a minimum compute capability of 3.0.


    What's Docker ?

    Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. (from What is Docker? |


    Mellanox’s Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build most efficient Machine Learning cluster enhanced by native RDMA over 100Gbps IB network.


    Setup Overview

    Before you start, make sure you are aware of the distributed TensorFlow architecture, see Glossary in Distributed TensorFlow for more info.
    In the distributed TensorFlow configuration described in this guide, we are using the following hardware specification.



    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

    Setup Logical Design

    Server Logical Design



    Server Wiring

    In our reference we'll wire only a 1st port to an IB switch.

    We'll cover the procedure late in Installing Mellanox OFED section.


    Server Block Diagram


    Docker Network Diagram

    Network Configuration

    Each server is connected to the SB7700 switch by a 100Gb IB copper cable. The switch port connectivity in our case is as follow:

    • 1st -8th ports – connected to Worker servers

    Server names with network configuration provided below

    Server typeServer nameIP and NICS              
    Internal networkExternal network
    Worker Server 01clx-mld-41ib0: From DHCP (reserved)
    Worker Server 02clx-mld-42ib0: From DHCP (reserved)
    Worker Server 03clx-mld-43ib0: From DHCP (reserved)
    Worker Server 04clx-mld-44ib0: From DHCP (reserved)
    Worker Server 05clx-mld-45ib0: From DHCP (reserved)
    Worker Server 06clx-mld-46ib0: From DHCP (reserved)
    Worker Server 07clx-mld-47ib0:
    enp129s0f0: From DHCP (reserved)
    Worker Server 08clx-mld-48ib0: From DHCP (reserved

    Deployment Guide


    Required a Host Software

    Prior to install Tensorflow, the following software must be installed.


    Disable a Nouveau kernel Driver on a Host


    Prior to installing NVIDIA last drivers and CUDA in Ubuntu 16.04, the Nouveau kernel driver must be disabled. To disable it, follow the procedure below.


    1. Check that the Nouveau kernel driver is loaded.
      $ lsmod |grep nouv
    2. Remove all NVIDIA packages.

      Skip this step if your system is fresh installed.
      $ sudo apt-get remove nvidia* && sudo apt autoremove
    3. Install the packages below for the build kernel.

      $ sudo apt-get install dkms build-essential linux-headers-generic -y
    4. Block and disable the Nouveau kernel driver.
      $ sudo vim /etc/modprobe.d/blacklist.conf
    5. Insert the follow lines to the blacklist.conf file.
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
    6. Disable the Nouveau kernel module and update the initramfs image.  (Although the nouveau-kms.conf file may not exist, it will not affect this step).
      $ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      $ sudo update-initramfs -u
    7. Reboot
      $ sudo reboot
    8. Check that the Nouveau kernel drive is not loaded.
      $ lsmod |grep nouveau


    Install General Dependencies

    1. To install general dependencies, run the commands below or paste each line.
      $ sudo apt-get install openjdk-8-jdk git build-essential python-virtualenv swig python-wheel libcupti-dev -y
    2. To install TensorFlow, you must install the following packages:
      • Numpy: A numerical processing package that TensorFlow requires
      • dev: Enables adding extensions to Python
      • pip: Enables installing and managing of certain Python packages
      • wheel: Enables management of Python compressed packages in the wheel (.whl) format

    To install these packages for Python 2.7

    $ sudo apt-get install python-numpy python-dev python-pip python-wheel -y



    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages


    Install the NVIDIA Drivers on a Host


    The 367 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86-x86_64-390.12_1).
    3. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    4. Once you accept the download please follow the steps listed below.
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604-390.12_1.0-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-drivers -y
      During the run, you will be asked to confirm several things such as the pre-install of something failure, no 32-bit libraries and more.
    5. Once installed using additional drivers, restart your computer.
      $ sudo reboot


    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia



    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi

    Enable the Subnet Manager(SM) on the IB Switch


    Refer to the MLNX-OS User Manual to become familiar with switch software (located at
    Before starting to use of the Mellanox switch, we recommend that you upgrade the switch to the latest MLNX-OS version.

    There are three options to select the best place to locate the SM:

    1. Enabling the SM on one of the managed switches. This is a very convenient and quick operation and make Infiniband ‘plug & play’ easily.
    2. Run /etc/init.d/opensmd on one or more servers. It is recommended to run the SM on a server in case there are 648 nodes or more.
    3. Use Unified Fabric Management (UFM®) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.

    We'll explain options 1 and 2 only

    Option 1: Configuring the SM on a Switch MLNX-OS® all Mellanox switch systems.
    To enable the SM on one of the managed switches follow the next steps.

    1. Login to the switch and enter to config mode:
      Mellanox MLNX-OS Switch Management

      switch login: admin
      Last login: Wed Aug 12 23:39:01 on ttyS0

      Mellanox Switch

      switch [standalone: master] > enable
      switch [standalone: master] # conf t
      switch [standalone: master] (config)#
    2. Run the command:
      switch [standalone: master] (config)#ib sm
      switch [standalone: master] (config)#
    3. Check if the SM is running. Run:

      switch [standalone: master] (config)#show ib sm
      switch [standalone: master] (config)#

    To save the configuration (permanently), run:

    switch (config) # configuration write



    Option 2: Configuring the SM on a Server (Skip this procedure if you enable SM on switch)

    To start up OpenSM on a server, simply run opensm from the command line on your management node by typing:

    # opensm


    Start OpenSM automatically on the head node by editing the /etc/opensm/opensm.conf file.

    Create a configuration file by running:

    # opensm –config /etc/opensm/opensm.conf

    Edit /etc/opensm/opensm.conf file with the following line:


    Upon initial installation, OpenSM is configured and running with a default routing algorithm. When running a multi-tier fat-tree cluster, it is recommended to change the following options to create the most efficient routing algorithm delivering the highest performance:


    For full details on other configurable attributes of OpenSM, see the “OpenSM – Subnet Manager” chapter of the Mellanox OFED for Linux User Manual.


    Installation Mellanox OFED for Ubuntu on a Host

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed. For more information click on Mellanox OFED for Linux User Manual.


    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your host.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from: > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.


      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.tgz


    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded ISO to /root
    3. Mount the ISO image on your machine.
      # mkdir /mnt/iso
      # mount -o loop /root/MLNX_OFED_LINUX-4.2- /mnt/iso
      # cd /mnt/iso

    4. Run the installation script.
      # ./mlnxofedinstall
    5. Reboot after the installation finished successfully.
      # /etc/init.d/openibd restart# reboot
      By default both ConnectX®-5 VPI ports are initialized as Infiniband ports.
    6. Check the ports’ mode is Infiniband
      # ibv_devinfo
    7. If you see the following - You need to change the interfaces port type to Infiniband

      Change the interfaces port type to Infiniband mode ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports.
      Change the mode to Infiniband. Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=1 is a Infiniband mode
      a. Start mst and see ports names
      # mst start
      # mst status

      b. Change the mode of both ports to Infiniband:

      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=1
      #Port 1 set to IB mode
      # reboot

      After each reboot you need to Disable 2nd port.
      c. Queries Infiniband devices and prints about them information that is available for use from userspace.


      # ibv_devinfo


    8. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.

      # ibdev2netdev

      # ifconfig ib0 netmask

    9. Insert to the /etc/network/interfaces file the lines below after the following lines:

      # vim /etc/network/interfaces

      auto enp129s0f0

      iface enp129s0f0 inet dhcp

      The new lines:
      auto ib0
      iface ib0 inet static
      # vim /etc/network/interfaces

      auto enp129s0f0
      iface enp129s0f0 inet dhcp

      auto ib0
      iface ib0 inet static
    10. Check the network configuration is set correctly.
      # ifconfig -a


    Docker installing and configured

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine

    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.


    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.


    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.


    Set Up the repository

    1. Update the apt package index:
      $ sudo apt-get update
    2. Install packages to allow apt to use a repository over HTTPS:
      $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
    3. Add Docker’s official GPG key:
      $ sudo curl -fsSL | sudo apt-key add -

      Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.
      $ sudo apt-key fingerprint 0EBFCD88
      pub   4096R/0EBFCD88 2017-02-22
      Key fingerprint = 9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
      uid                  Docker Release (CE deb) <>
      sub   4096R/F273FCD8 2017-02-22

    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64]  $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce


    Customize the docker0 bridge

    The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it. You can specify one or more of the following settings to configure the default bridge network

         "bip": "",
         "fixed-cidr": "",
         "mtu": 1500,
         "dns": ["",""]

    The same options are presented as flags to dockerd, with an explanation for each:

    • --bip=CIDR: supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example:
    • --fixed-cidr=CIDR: restrict the IP range from the docker0 subnet, using standard CIDR notation. For example:
    • --mtu=BYTES: override the maximum packet length on docker0. For example: 1500.
    • --dns=[]: The DNS servers to use. For example: --dns=,


    Restart Docker after making changes to the daemon.json file.

    $ sudo /etc/init.d/docker restart

    Set communicating to the outside world

    Check ip forwarding is enabled in kernel:

    $ sysctl net.ipv4.conf.all.forwarding

    net.ipv4.conf.all.forwarding = 1

    If disabled

    net.ipv4.conf.all.forwarding = 0

    please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1


    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.

    To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT


    Add IP route with specific subnet

    On each host you shall add routing to container subnet on other hosts. Please see example for routing to be added on one host-41:

    host-41$ sudo ip route add via
    host-41$ sudo ip route add via
    host-41$ sudo ip route add via

    A quick check on each host

    Give your environment a quick test by spawning simple container:

    $ docker run hello-world

    Create or pull a base image and run Container


    Option 1 - Privileged mode

    Pull the image from Docker Hub and run a Docker Container in privileged mode from the remote repository by:

    $ sudo docker run -it --privileged --name=mnlx-tf-1-6-prvlg mellanox/docker-tf-1-6 bash


    Option 2 - non- Privileged mode

    Pull the image from Docker Hub and run a Docker Container in not privileged mode from the remote repository by:

    $ sudo docker run -it --cap-add=IPC_LOCK --device=/dev/infiniband/uverbs1 --name=mnlx-tf-1-6-nonprvlg mellanox/docker-tf-1-6 bash

    Option 3

    Docker can build images automatically by reading the instructions from a Dockerfile.

    A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.



    1. Create an empty directory.
    2. Enter the new directory, create a file called Dockerfile, copy-and-paste the following content into that file, and save it.
      Take note of the comments that explain each statement in your new Dockerfile.

    FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04


    MAINTAINER Boris Kovalev <>




    # Pick up some MOFED and TF dependencies

    RUN apt-get update && apt-get install -y --no-install-recommends \

            net-tools \

            ethtool \

            perl \

            lsb-release \

            iproute2 \

            pciutils \

            libnl-route-3-200 \

            kmod \

            libnuma1 \

            lsof \

            linux-headers-4.4.0-92-generic \

            build-essential \

            curl \

            git \

            wget \

            libcurl3-dev \

            libfreetype6-dev \

            libpng12-dev \

            libzmq3-dev \

            pkg-config \

            python-dev \

            python-virtualenv \

            swig \

            python-wheel \

            libcupti-dev \

            python-numpy \

            python-pip \

            python-wheel \

            python-libxml2 \

            rsync \

            software-properties-common \

            unzip \

            zip \

            zlib1g-dev \

            openjdk-8-jdk \

            openjdk-8-jre-headless \

            git \

            && \

        apt-get clean && \

        rm -rf /var/lib/apt/lists/*


    # Download and install Mellanox OFED 4.2.1


    RUN wget && \

            tar -xzvf MLNX_OFED_LINUX-4.1- && \

            MLNX_OFED_LINUX-4.2- --user-space-only --without-fw-update --all -q && \

            cd .. && \

            rm -rf MLNX_OFED_LINUX-4.2- && \

            rm -rf *.tgz


    # Download and install pip and pip packages


    RUN curl -fSsL -O && \

        python && \



    RUN pip --no-cache-dir install \

            ipykernel \

            matplotlib \

            numpy \

            scipy \

            sklearn \

            pandas \

            && \

       python -m ipykernel.kernelspec


    # Download and pip TensorFlow v1.6.0 GA package with verbs support


    RUN git clone && \


    # Install TensorFlow v1.6.0 GA with verbs support


        pip --no-cache-dir install --upgrade /Tensorflow-mlnx/tensorflow-1.6.0-cp27-cp27mu-linux_x86_64.whl && \

        rm -rf /Tensorflow-mlnx/tensorflow-1.6.0-cp27-cp27mu-linux_x86_64.whl && \

        rm -rf /pip && \

        rm -rf /root/.cache


    # For CUDA profiling, TensorFlow requires CUPTI.

    ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH


    # TensorBoard

    EXPOSE 6006


    RUN ["/bin/bash"]


    Install Nvidia-docker


    If you have a custom /etc/docker/daemon.json, the nvidia-docker2 package might override it.

    If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers

    $ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
    $ sudo apt-get purge -y nvidia-docker


    Add the package repositories

    $ curl -s -L | \
    sudo apt-key add -
    $ curl -s -L | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list

    $ sudo apt-get update


    Install nvidia-docker2 and reload the Docker daemon configuration

    $ sudo apt-get install -y nvidia-docker2

    $ sudo pkill -SIGHUP dockerd


    Test nvidia-smi with the latest official CUDA image

    $ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi


    Build Docker Image and run the container

    1. Now run the build command. This creates a Docker image, which we’re going to tag using -t so it has a friendly name.
      $ docker run --runtime=nvidia build -t tf15image .
    2. Where is your built image? It’s in your machine’s local Docker image registry:
      $ docker run --runtime=nvidia image
    3. Run a Docker Container in not privileged mode from the remote repository by:
      $ docker run --runtime=nvidia run -it --cap-add=IPC_LOCK --device=/dev/infiniband/uverbs1 --name=my-tf-1-5-nonprvlg tf13image bash

    Verify CUDA in the container


    Ensure you are in the container. Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    containerID$ nvidia-smi

    Validate TensorFlow in the container

    To validate the TensorFlow installation:

    1. Change directory (cd) to any directory on your system other than the tensorflow subdirectory from which you invoked the configure command.
    2. Invoke python:

      containerID$ cd /

      containerID$ python


      >>> import tensorflow as tf

      >>> hello = tf.constant('Hello, TensorFlow!')

      >>> sess = tf.Session()

      >>> print(

      Hello, TensorFlow!

      >>> a = tf.constant(10)

      >>> b = tf.constant(32)

      >>> print( + b))



      CTRL-D to EXIT.


    Validate MOFED


    Check the mofed version and uverbs:

    containerID$ ofed_info -s MLNX_OFED_LINUX-4.2-
    containerID$ ls /dev/infiniband/uverbs1


    Run Bandwidth stress over IB in container.:


    ib_write_bw -a -d mlx5_1 &


    ib_write_bw -a -F $server_IP -d mlx5_1 --report_gbits

    In this way you can run Bandwidth stress over IB between containers.




    Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison

    Google published a collection of performance benchmarks that highlight TensorFlow's speed and scalability when training image classification models like InceptionV3, ResNet and VGG16.

    Here we will provide our performance benchmark results for InceptionV3 and ResNet-50 over TCP and RDMA.

    Benchmarks ran using both real and synthetic data. We believe it is important to include real data (ImageNet 2012 DataSet) measurements when benchmarking a platform.

    Testing with synthetic data was done by using a tf.Variable set to the same shape as the data expected by each model for ImageNet.

    This load tests both the underlying hardware and the framework at preparing data for actual training.

    We start with synthetic data to remove disk I/O as a variable and to set a baseline. Real data is then used to verify that the TensorFlow input pipeline and the underlying disk I/O are saturating the compute units.

    Server's hardware and configurations used for TCP and IB RDMA benchmarks are identical.


    Details for our benchmarks


    • Instance type: See setup overview
    • GPU: 32x NVIDIA® Tesla® P100
    • OS: Ubuntu 16.04.3 LTS with tests run via Docker
    • CUDA / cuDNN: 9.1 / 7.0
    • TensorFlow GitHub : v1.6.0 GA
    • Benchmark GitHub hash: 2e70767
    • Build Command: bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
    • Disk: Local NVMe
    • DataSet: ImageNet 2012
    • Test Date: March 2018


    The batch size and optimizer used for the tests are listed in the table.


    OptionsVGG 16
    Batch size per GPU64



    Configuration used for each model.




    The setup for the runs is included 8 parameter and worker servers and was explained in the setup overview part of the document.








    This script was run to generate the above results.


    In order to create results that are as repeatable as possible, each test was run 3 times and then the times were averaged together. GPUs are run in their default state on the given platform. For each test, 10 warm up steps are done and then the next 100 steps are average