Reference Deployment Guide for RDMA accelerated Caffe2 with an NVIDIA GPU Card over 100Gb Infiniband Network running on Docker Containers

Version 3

    In this document we will demonstrate a distributed deployment procedure of RDMA accelerated Caffe2 running on Docker Containers and Mellanox end-to-end 100 Gb/s Infiniband (IB) solution.

    This document describes the process of building the Caffe2 from sources for Ubuntu 16.04.2 LTS and Docker 17.06 on four physical servers.







    What is Caffe2 ?

    Caffe2 is a deep learning framework that provides an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms. You can bring your creations to scale using the power of GPUs in the cloud or to the masses on mobile with Caffe2’s cross-platform libraries. Caffe2 supports Cuda 8.0 & CuDNN 6.0 (req. registration), in this guide we will use the installing from sources from their website for a much easier installation. In order to use Caffe2 with GPU support, you must have an NVIDIA GPU with a minimum compute capability of 3.0.


    What's Docker ?

    Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. (from What is Docker? |


    Mellanox’s Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build most efficient Machine Learning cluster enhanced by native RDMA over 100Gbps IB network.


    Setup Overview

    Before you start, make sure you are aware of the distributed training, see  following link for more info.

    In the distributed Caffe2 configuration described in this guide, we are using the following hardware specification.




    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

    Setup Logical Design

    Server Logical Design



    Server Wiring

    If you have Dual Port NIC you shall disable one port.
    Due to certain limitations in current TensorFlow version you can face issues if both ports will be enabled.

    In our reference we'll wire 1st port to IB switch and will disable the 2nd port.

    We'll cover the procedure late in Installing Mellanox OFED section.


    Server Block Diagram


    Docker Network Diagram

    Network Configuration

    Each server is connected to the SB7700 switch by a 100Gb IB copper cable. The switch port connectivity in our case is as follow:

    • 1st -4th ports – connected to Worker servers

    Server names with network configuration provided below

    Server typeServer nameIP and NICS               
    Internal networkExternal network
    Node Server 01clx-mld-41ib0: From DHCP (reserved)
    Node Server 02clx-mld-42ib0: From DHCP (reserved)
    Node Server 03clx-mld-43ib0: From DHCP (reserved)
    Node Server 04clx-mld-44ib0: From DHCP (reserved)

    Deployment Guide


    Required a Host Software

    Prior to install Caffe2, the following software must be installed.


    Disable a Nouveau kernel Driver on a Host


    Prior to installing NVIDIA last drivers and CUDA in Ubuntu 16.04, the Nouveau kernel driver must be disabled. To disable it, follow the procedure below.


    1. Check that the Nouveau kernel driver is loaded.
      $ lsmod |grep nouv
    2. Remove all NVIDIA packages.

      Skip this step if your system is fresh installed.
      $ sudo apt-get remove nvidia* && sudo apt autoremove
    3. Install the packages below for the build kernel.

      $ sudo apt-get install dkms build-essential linux-headers-generic -y
    4. Block and disable the Nouveau kernel driver.
      $ sudo vim /etc/modprobe.d/blacklist.conf
    5. Insert the follow lines to the blacklist.conf file.
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
    6. Disable the Nouveau kernel module and update the initramfs image.  (Although the nouveau-kms.conf file may not exist, it will not affect this step).
      $ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      $ sudo update-initramfs -u
    7. Reboot
      $ sudo reboot
    8. Check that the Nouveau kernel drive is not loaded.
      $ lsmod |grep nouveau


    Install General Dependencies

    1. To install general dependencies, run the commands below or paste each line.
      $ sudo apt-get update
    2. To install Caffe2, you must install the following packages:
      • dev: Enables adding extensions to Python
      • pip: Enables installing and managing of certain Python packages

    To install these packages for Python 2.7

    $ sudo apt-get install -y --no-install-recommends build-essential cmake git libgoogle-glog-dev libprotobuf-dev protobuf-compiler python-dev python-pip
    $ sudo pip install numpy protobuf

    Install Optional Dependencies

    1. To install optional dependencies, run the commands below or paste each line.
      $ sudo apt-get install -y --no-install-recommends libgflags-dev
      $ sudo apt-get install -y --no-install-recommends libgtest-dev libiomp-dev libleveldb-dev liblmdb-dev libopencv-dev libopenmpi-dev libsnappy-dev openmpi-bin openmpi-doc python-pydot
      $ sudo pip install flask future graphviz hypothesis jupyter matplotlib pydot python-nvd3 pyyaml requests scikit-image scipy setuptools six tornado


    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages


    Install the NVIDIA Drivers


    The 367 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86_64-375.51).
    3. Exit the GUI (as the drivers for graphic devices are running at a low level).
      $ sudo service lightdm stop
    4. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    5. Once you accept the download please follow the steps listed below.
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-drivers
      During the run, you will be asked to confirm several things such as the pre-install of something failure, no 32-bit libraries and more.
    6. Once installed using additional drivers, restart your computer.
      $ sudo reboot


    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia



    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi

    Enable the Subnet Manager(SM) on the IB Switch

    There are three options to select the best place to locate the SM:

    1. Enabling the SM on one of the managed switches. This is a very convenient and quick operation and make Infiniband ‘plug & play’ easily.
    2. Run /etc/init.d/opensmd on one or more servers. It is recommended to run the SM on a server in case there are 648 nodes or more.
    3. Use Unified Fabric Management (UFM®) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.

    We'll explain options 1 and 2 only


    Option 1: Configuring the SM on a Switch MLNX-OS® all Mellanox switch systems.
    To enable the SM on one of the managed switches follow the next steps.

    1. Login to the switch and enter to config mode:
      Mellanox MLNX-OS Switch Management

      switch login: admin
      Last login: Wed Aug 12 23:39:01 on ttyS0

      Mellanox Switch

      switch [standalone: master] > enable
      switch [standalone: master] # conf t
      switch [standalone: master] (config)#
    2. Run the command:
      switch [standalone: master] (config)#ib sm
      switch [standalone: master] (config)#
    3. Check if the SM is running. Run:
      switch [standalone: master] (config)#show ib sm
      switch [standalone: master] (config)#

    To save the configuration (permanently), run:

    switch (config) # configuration write



    Option 2: Configuring the SM on a Server (Skip this procedure if you enable SM on switch)

    To start up OpenSM on a server, simply run opensm from the command line on your management node by typing:

    # opensm


    Start OpenSM automatically on the head node by editing the /etc/opensm/opensm.conf file.

    Create a configuration file by running:

    # opensm –config /etc/opensm/opensm.conf

    Edit /etc/opensm/opensm.conf file with the following line:


    Upon initial installation, OpenSM is configured and running with a default routing algorithm. When running a multi-tier fat-tree cluster, it is recommended to change the following options to create the most efficient routing algorithm delivering the highest performance:


    For full details on other configurable attributes of OpenSM, see the “OpenSM – Subnet Manager” chapter of the Mellanox OFED for Linux User Manual.


    Installation Mellanox OFED for Ubuntu

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed. For more information click on Mellanox OFED for Linux User Manual.


    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your host.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from: > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.
    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.
      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.tgz

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded tgz to /tmp
    3. Mount the ISO image on your machine.
      # cd /tmp# tar -xzvf MLNX_OFED_LINUX-4.2- cd MLNX_OFED_LINUX-4.2-
    4. Run the installation script.
      # ./mlnxofedinstall --all --force
    5. Restart openbd and
    6. Reboot after the installation finished successfully.
      # /etc/init.d/openibd restart# reboot
      By default both ConnectX®-5 VPI ports are initialized as Infiniband ports.
    7. Disable unused the 2nd port on the device.
      Identify PCI ID of your NIC ports:
      # lspci | grep Mellanox05:00.0 Infiniband controller: Mellanox Technologies Device 101905:00.1 Infiniband controller: Mellanox Technologies Device 1019
      Disable 2nd port
      # echo 0000:05:00.1 > /sys/bus/pci/drivers/mlx5_core/unbind
    8. Check the ports’ mode is Infiniband
      # ibv_devinfo

    9. If you see the following - You need to change the interfaces port type to Infiniband
      Change the interfaces port type to Infiniband mode ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports.
      Change the mode to Infiniband. Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=1 is a Infiniband mode
      a. Start mst and see ports names
      # mst start
      # mst status
      b. Change the mode of both ports to Infiniband:# mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=1
      #Port 1 set to IB mode
      # reboot
      After each reboot you need to Disable 2nd port.
      c. Queries Infiniband devices and prints about them information that is available for use from userspace.
      # ibv_devinfo

    10. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.
      # ibdev2netdev# ifconfig ib0 netmask
    11. Insert to the /etc/network/interfaces file the lines below after the following lines:
      # vim /etc/network/interfacesauto eno1iface eno1 inet dhcp
      The new lines:
      auto ib0
      iface ib0 inet static
      # vim /etc/network/interfaces

      auto eno1
      iface eno1 inet dhcp

      auto ib0
      iface ib0 inet static
    12. Check the network configuration is set correctly.
      # ifconfig -a


    Docker installing and configured

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine

    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.


    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.


    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.


    Set Up the repository

    1. Update the apt package index:
      $ sudo apt-get update
    2. Install packages to allow apt to use a repository over HTTPS:
      $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
    3. Add Docker’s official GPG key:
      $ sudo curl -fsSL | sudo apt-key add -

      Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.
      $ sudo apt-key fingerprint 0EBFCD88
      pub   4096R/0EBFCD88 2017-02-22
      Key fingerprint = 9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
      uid                  Docker Release (CE deb) <>
      sub   4096R/F273FCD8 2017-02-22


    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64]  $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce


    Customize the docker0 bridge

    The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it. You can specify one or more of the following settings to configure the default bridge network

         "bip": "",
         "fixed-cidr": "",
         "mtu": 1500,
         "dns": ["",""]

    The same options are presented as flags to dockerd, with an explanation for each:

    • --bip=CIDR: supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example:
    • --fixed-cidr=CIDR: restrict the IP range from the docker0 subnet, using standard CIDR notation. For example:
    • --mtu=BYTES: override the maximum packet length on docker0. For example: 1500.
    • --dns=[]: The DNS servers to use. For example: --dns=,


    Restart Docker after making changes to the daemon.json file.

    $ sudo /etc/init.d/docker restart


    Set communicating to the outside world

    Check ip forwarding is enabled in kernel:

    $ sysctl net.ipv4.conf.all.forwarding

    net.ipv4.conf.all.forwarding = 1

    If disabled

    net.ipv4.conf.all.forwarding = 0

    please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1


    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.

    To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT


    Add IP route with specific subnet

    On each host you shall add routing to container subnet on other hosts. Please see example for routing to be added on one host-41:

    host-41$ sudo ip route add via
    host-41$ sudo ip route add via
    host-41$ sudo ip route add via

    A quick check on each host

    Give your environment a quick test by spawning simple container:

    $ docker run hello-world


    Create or pull a base image and run Container


    Docker can build images automatically by reading the instructions from a Dockerfile.

    A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.



    1. Create an empty directory.
    2. Enter the new directory, create a file called Dockerfile, copy-and-paste the following content into that file, and save it.
      Take note of the comments that explain each statement in your new Dockerfile.

    FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04


    MAINTAINER Boris Kovalev <>


    # caffe2 install with gpu support


    RUN apt-get update && apt-get install -y --no-install-recommends \

        build-essential \

        cmake \

        git \

        libgflags-dev \

        libgoogle-glog-dev \

        libgtest-dev \

        libiomp-dev \

        libleveldb-dev \

        liblmdb-dev \

        libopencv-dev \

        libopenmpi-dev \

        libprotobuf-dev \

        libsnappy-dev \

        openmpi-bin \

        openmpi-doc \

        protobuf-compiler \

        python-dev \

        python-numpy \

        python-pip \

        python-pydot \

        python-setuptools \

        python-scipy \

        wget \

        && rm -rf /var/lib/apt/lists/*

    RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \

        pip install --no-cache-dir \

        flask \

        future \

        graphviz \

        hypothesis \

        jupyter \

        matplotlib \

        numpy \

        protobuf \

        pydot \

        python-nvd3 \

        pyyaml \

        requests \

        scikit-image \

        scipy \

        setuptools \

        six \



    # Download and install Mellanox OFED 4.1.1


    RUN wget && \

            tar -xzvf MLNX_OFED_LINUX-4.2- && \

            MLNX_OFED_LINUX-4.2- --user-space-only --without-fw-update --all -q && \

            cd .. && \

            rm -rf MLNX_OFED_LINUX-4.2- && \

            rm -rf *.tgz


    # Download and install Caffe2


    RUN git clone --recursive

    RUN cd caffe2

    RUN git checkout 91f63a2361fb8671e103a8d5601adec8354299b5    # where 91f63a2361fb8671e103a8d5601adec8354299b5 is the desired branch (stable version)

    RUN git submodule update

    RUN git submodule sync --recursive

    RUN git submodule update --init --recursive

    RUN mkdir build && cd build \

        && cmake .. \

        -DCUDA_ARCH_NAME=Manual \

        -DCUDA_ARCH_BIN="35 52 60 61" \

        -DCUDA_ARCH_PTX="61" \

        -DUSE_NNPACK=OFF \


        -DUSE_IBVERBS=ON \

        && make -j"$(nproc)" install \

        && ldconfig \

        && make clean \

        && cd .. \

        && rm -rf build


    ENV PYTHONPATH /usr/local


    RUN ["/bin/bash"]


    Build Docker Image and run the container

    1. Now run the build command. This creates a Docker image, which we’re going to tag using -t so it has a friendly name.
      $ nvidia-docker build -t caffe2image .
    2. Where is your built image? It’s in your machine’s local Docker image registry:
      $ nvidia-docker image
    3. Run a Docker Container in not privileged mode from the remote repository by:
      $ nvidia-docker run -it --cap-add=IPC_LOCK --device=/dev/infiniband/uverbs1 --name=my-caffe2-nonprvlg caffe2image bash

    Verify CUDA in the container

    Ensure you are in the container. Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    containerID$ nvidia-smi

    Validate Caffe2 in the container

    To validate the Caffe2 installation run the following commands:

    containerID$ python -m caffe2.python.operator_test.relu_op_test

    Validate MOFED

    Check the mofed version and uverbs:

    containerID$ ofed_info -s
    containerID$ ls /dev/infiniband/uverbs1

    Run Bandwidth stress over IB in container.:


    ib_write_bw -a -d mlx5_1 &


    ib_write_bw -a -F $server_IP -d mlx5_1 --report_gbits

    In this way you can run Bandwidth stress over IB between containers.


    Distributed Caffe2 run - sample

    To run distributed Caffe2, I use HPC-X or openMPI. Please see here how to install HPC-X.

    I use custom imagenet_cars_boats dataset in my runs.

    The mpirun here is only used for raising the threads from the cluster. In fact you can run it by hand on each node manually.

    Then run the MPI cmdline, it long but easy to understand.


    $ bs=64 if=mlx5_1 tr=ibverbs traindata="/cfdata/imagenet_cars_boats_train" testdata="/cfdata/imagenet_cars_boats_val" filestore="/tmp"; mpirun -x PYTHONPATH=/caffe2/build -host -n 1 python --train_data $traindata --test_data $testdata --num_gpus 4 --batch_size $bs --num_shards=4 --shard_id=0 --run_id=1234 --file_store_path $filestore --distributed_transport=$tr --distributed_interface=$if : -x PYTHONPATH=/root/caffe2.pz/build -host -n 1 python --train_data $traindata --test_data $testdata --num_gpus 4 --batch_size $bs --num_shards=4 --shard_id=1 --run_id=1234 --file_store_path $filestore --distributed_transport=$tr --distributed_interface=$if -x PYTHONPATH=/caffe2/build -host -n 1 python --train_data $traindata --test_data $testdata --num_gpus 4 --batch_size $bs --num_shards=4 --shard_id=2 --run_id=1234 --file_store_path $filestore --distributed_transport=$tr --distributed_interface=$if : -x PYTHONPATH=/caffe2/build -host -n 1 python --train_data $traindata --test_data $testdata --num_gpus 4 --batch_size $bs --num_shards=4 --shard_id=3 --run_id=1234 --file_store_path $filestore --distributed_transport=$tr --distributed_interface=$if