Reference Deployment Guide for RDMA over Converged Ethernet (RoCE) accelerated Caffe2 with an NVIDIA GPU Card over 100Gb Ethernet Network running on Docker Containers

Version 15

    In this document we will demonstrate a distributed deployment procedure of RDMA accelerated Caffe2 running on Docker Containers and Mellanox end-to-end 100 Gb/s Ethernet solution.

    This document describes the process of building the Caffe2 from sources for Ubuntu 16.04.3 LTS and Docker 18.03.0-ce on 8 physical servers.







    Setup Overview

    Before you start, make sure you are aware of the distributed training, see following link for more info.

    In the distributed Caffe2 configuration described in this guide, we are using the following hardware specification.


    What is Caffe2 ?

    Caffe2 is a deep learning framework that provides an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms.

    You can bring your creations to scale using the power of GPUs in the cloud or to the masses on mobile with Caffe2’s cross-platform libraries.

    Caffe2 supports Cuda 9.1 & CuDNN 7.1 (req. registration), in this guide we will use the installing from sources from their website for a much easier installation.

    In order to use Caffe2 with GPU support, you must have an NVIDIA GPU with a minimum compute capability of 3.0.


    What's Docker ?

    Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.

    Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. (from What is Docker? |


    Mellanox’s Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities.

    Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build most efficient Machine Learning cluster enhanced by native RDMA over 100Gbps Ethernet network.



    Solution Design

    Setup Logical Design

    Server Logical Design


























    HW configuration

    Bill of Materials - BOM

    In the distributed Caffe2 configuration described in this guide, we are using the following hardware specification.

    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)



    Physical Network Connections


    Docker Network Diagram

    Docker’s networking subsystem is pluggable, using drivers. Several drivers exist by default, and provide core networking functionality.

    We will use in our setup the host network driver for a container, that container’s network stack is not isolated from the Docker host.

    For instance, if you run a container which binds to port 80 and you use host networking, the container’s application will be available on port 80 on the host’s IP address.


    Network Configuration

    In our reference we will use a single port per server. In case of a single port NIC we will wire the available port.

    In case of dual port NIC we will wire the 1st port to an Ethernet switch and will not use the 2nd port.

    We will cover the procedure later in the Installing Mellanox OFED section.

    Each server is connected to the SN2700 switch by a 100GbE copper cable.

    The switch port connectivity in our case is as follow:

    • 1st -8th ports – connected to Node Servers


    Server names with network configuration provided below

    Server typeServer nameIP and NICS

    Internal network -

    100 GbpsE

    Management network -

    1 GbpsE

    Node Server 01clx-mld-41enp0f0: From DHCP (reserved)
    Node Server 02clx-mld-42enp0f0: From DHCP (reserved)
    Node Server 03clx-mld-43enp0f0: From DHCP (reserved)
    Node Server 04clx-mld-44enp0f0: From DHCP (reserved)
    Node Server 05clx-mld-45enp0f0: From DHCP (reserved)
    Node Server 06clx-mld-46enp0f0: From DHCP (reserved)
    Node Server 07clx-mld-47enp0f0: From DHCP (reserved)
    Node Server 08clx-mld-48enp0f0: From DHCP (reserved)


    Switch OS installation / configuration


    Please start from the HowTo Get Started with Mellanox switches guide if you don't familiar with Mellanox switch software.

    For more information please refer to the MLNX-OS User Manual located at or -> Switches


    In first step please update your switch OS to the latest ONYX OS software. Please use this community guideHowTo Upgrade MLNX-OS Software on Mellanox switch systems.

    We will accelerate Caffe2 by using RDMA transport.
    There are several industry standard network configuration for RoCE deployment.

    You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.

    In our deployment we’ll configure our network to be lossless and will use DSCP on host and switch sides:


    Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.


    swx-mld-1-2 [standalone: master] > enable

    swx-mld-1-2 [standalone: master] # configure terminal

    swx-mld-1-2 [standalone: master] (config) # show running-config


    ## Running database "initial"

    ## Generated at 2018/03/10 09:38:38 +0000

    ## Hostname: swx-mld-1-2




    ## Running-config temporary prefix mode setting


    no cli default prefix-modes enable



    ## License keys


    license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD



    ## Interface Ethernet buffer configuration


    traffic pool roce type lossless

    traffic pool roce memory percent 50.00

    traffic pool roce map switch-priority 3



    ## LLDP configuration





    ## QoS switch configuration


    interface ethernet 1/1-1/32 qos trust L3

    interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500



    ## DCBX ETS configuration


    interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict




    ## Other IP configuration


    hostname swx-mld-1-2



    ## AAA remote server configuration


    # ldap bind-password ********

    # radius-server key ********

    # tacacs-server key ********



    ## Network management configuration


    # web proxy auth basic password ********



    ## X.509 certificates configuration



    # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991

    # (public-cert config omitted since private-key config is hidden)



    ## Persistent prefix mode setting


    cli default prefix-modes enable


    Nodes installation / configuration


    Required a Host Software

    Prior to install Caffe2, the following software must be installed.


    Disable a Nouveau kernel Driver on a Host


    Prior to installing NVIDIA last drivers and CUDA in Ubuntu 16.04, the Nouveau kernel driver must be disabled. To disable it, follow the procedure below.


    1. Check that the Nouveau kernel driver is loaded.
      $ lsmod |grep nouv
    2. Remove all NVIDIA packages.

      Skip this step if your system is fresh installed.
      $ sudo apt-get remove nvidia* && sudo apt autoremove
    3. Install the packages below for the build kernel.

      $ sudo apt-get install dkms build-essential linux-headers-generic -y
    4. Block and disable the Nouveau kernel driver.
      $ sudo vim /etc/modprobe.d/blacklist.conf
    5. Insert the follow lines to the blacklist.conf file.
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
    6. Disable the Nouveau kernel module and update the initramfs image. (Although the nouveau-kms.conf file may not exist, it will not affect this step).
      $ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      $ sudo update-initramfs -u
    7. Reboot
      $ sudo reboot
    8. Check that the Nouveau kernel drive is not loaded.
      $ lsmod |grep nouveau

    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update # Fetches the list of available update
    $ sudo apt-get upgrade -y # Strictly upgrades the current packages


    Install the NVIDIA Drivers


    The 390 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86_64-390.12_1).
    3. Exit the GUI (as the drivers for graphic devices are running at a low level). 
      $ sudo service lightdm stop
    4. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    5. Once you accept the download please follow the steps listed below.
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604-390.12_1.0-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-drivers
      During the run, you will be asked to confirm several things such as the pre-install of something failure, no 32-bit libraries and more.
    6. Once installed using additional drivers, restart your computer.
      $ sudo reboot


    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia



    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi


    Installation Mellanox OFED for Ubuntu

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed.

    For more information click on Mellanox OFED for Linux User Manual.



    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your servers share folder.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from: > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.
    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.
      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

    Installing Mellanox OFED

    MLNX_OFED is installed by running themlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded ISO to /root
    3. Mount the ISO image on your machine.
      # mkdir /mnt/iso# mount -o loop /share/MLNX_OFED_LINUX-4.2- /mnt/iso# cd /mnt/iso
    4. Run the installation script.
      # ./mlnxofedinstall
    5. Reboot after the installation finished successfully.
      # /etc/init.d/openibd restart# reboot
      By default both ConnectX®-5 VPI ports are initialized as InfiniBand ports.
      ConnectX®-5 ports can be individually configured to work as InfiniBand or Ethernet ports.
    6. Check the ports’ mode is Ethernet
      # ibv_devinfo
    7. If you see the following, change the interfaces port type to Ethernet

      Change the interfaces port type to Ethernet mode.
      Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=2 is a Ethernet modea. Start mst and see ports names
      # mst start
      # mst status
      b. Change the mode of 1 port to Ethernet:
      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2
      Port 1 set to Ethernet mode
      # reboot
      c. Query the Ethernet devices and print the information available to use it from the userspace.
      # ibv_devinfo
    8. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.
      # ibdev2netdev
      # ifconfig enp1f0 netmask
    9. Insert to the /etc/network/interfaces file the lines below after the following lines:
      # vim /etc/network/interfaces

      auto eno0
      iface eno0 inet dhcp
      The new lines:
      auto enp1f0
      iface enp1f0 inet static
      address 31.31.31.xx
      # vim /etc/network/interfaces
      auto eno0

      iface eno0 inet dhcp

      auto enp1f0
      iface eenp1f0 inet static
    10. Check the network configuration is set correctly.
      # ifconfig -a

    Lossless fabric with L3(DSCP) configuration

    RDMA been initially developed for Infiniband networks which are inherently lossless. They incorporate a link level flow control to ensure that packets are not dropped within the fabric.
    RoCE implements the RDMA protocol over a standard Ethernet/IP network, which can be lossy.
    Due to the performance implications of a lossy network when running RoCE, it is recommended to enable flow control within your fabric.

    1. Check the flow control settings for Mellanox network adapters by run the command: ethtool -a <mlnx interface name>
      You shall get RX and TX set off. If the RX and TX settings are turned on, as shown below, then they should be disabled:
      # ethtool -A enp1f0 rx off tx off
      # ethtool -a enp1f0
      Pause parameters for enp1f0:
      Autonegotiate: off
      RX: off
      TX: off
    2. Follow the procedure in this post: Lossless RoCE Configuration for Linux Drivers in DSCP-Based QoS Mode
      It provides a configuration example for Mellanox devices installed with MLNX_OFED running RoCE over a
      lossless network, in DSCP-based QoS mode.

    Docker installing and configured

    Uninstall old versions

    To uninstall old versions, we recommend run following command:

    $ sudo apt-get remove docker docker-engine

    It’s OK if apt-get reports that none of these packages are installed.

    The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.


    Install Docker CE

    For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.


    Install using the repository

    Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.


    Set Up the repository

    1. Update the apt package index:
      $ sudo apt-get update
    2. Install packages to allow apt to use a repository over HTTPS:
      $ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
    3. Add Docker’s official GPG key:
      $ sudo curl -fsSL | sudo apt-key add -

      Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.
      $ sudo apt-key fingerprint 0EBFCD88
      pub 4096R/0EBFCD88 2017-02-22
      Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
      uid Docker Release (CE deb) <>
      sub 4096R/F273FCD8 2017-02-22


    Install Docker CE

    Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced.

    $ sudo add-apt-repository "deb [arch=amd64] $(lsb_release -cs) stable"
    $ sudo apt-get update
    $ sudo apt-get install docker-ce


    Set communicating to the outside world

    Check ip forwarding is enabled in kernel:

    $ sysctl net.ipv4.conf.all.forwarding

    net.ipv4.conf.all.forwarding = 1

    If disabled

    net.ipv4.conf.all.forwarding = 0

    please enable and check again:

    $ sysctl net.ipv4.conf.all.forwarding=1


    For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP.

    To override this default behavior you can manually change the default policy:

    $ sudo iptables -P FORWARD ACCEPT


    Add IP route with specific subnet

    On each host you shall add routing to container subnet on other hosts. Please see example for routing to be added on one host-41:

    host-41$ sudo ip route add via
    host-41$ sudo ip route add via
    host-41$ sudo ip route add via

    A quick check on each host

    Give your environment a quick test by spawning simple container:

    $ docker run hello-world

    Create or pull a base image and run Container


    Docker can build images automatically by reading the instructions from a Dockerfile.

    A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.


    1. Create an empty directory.
    2. Enter the new directory, create a file called Dockerfile, copy-and-paste the following content into that file, and save it.
      Take note of the comments that explain each statement in your new Dockerfile.

    FROM nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04

    LABEL maintainer="Boris KOvalev"


    # caffe2 install with gpu support


    # Set MOFED directory, image and working directory








    WORKDIR /tmp/


    RUN apt-get update && apt-get install -y --no-install-recommends \

    iproute2 \

    lsb-release \

    redis-tools \

    libhiredis-dev \

    iputils-ping \

    net-tools \

    ethtool \

    perl \

    pciutils \

    libnl-route-3-200 \

    kmod \

    libnuma1 \

    lsof \

    linux-headers-4.4.0-92-generic \

    python-libxml2 \

    build-essential \

    cmake \

    git \

    libgflags-dev \

    libgoogle-glog-dev \

    libgtest-dev \

    libiomp-dev \

    libleveldb-dev \

    liblmdb-dev \

    libopencv-dev \

    libprotobuf-dev \

    libsnappy-dev \

    protobuf-compiler \

    python-dev \

    python-numpy \

    python-pip \

    python-pydot \

    python-setuptools \

    python-scipy \

    wget \

    && rm -rf /var/lib/apt/lists/*


    RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \

    pip install --no-cache-dir \

    flask \

    future \

    graphviz \

    hypothesis \

    jupyter \

    matplotlib \

    numpy \

    protobuf \

    pydot \

    python-nvd3 \

    pyyaml \

    requests \

    scikit-image \

    scipy \

    setuptools \

    six \




    ########## Mellanox OFED INSTALLATION STEPS ###################


    tar -xzvf ${MOFED_IMAGE} && \

    ${MOFED_DIR}/mlnxofedinstall --user-space-only --without-fw-update --all -q && \

    cd .. && \

    rm -rf ${MOFED_DIR} && \

    rm -rf *.tgz


    ########## CAFFE 2 INSTALLATION STEPS ###################


    RUN git clone --branch master --recursive

    RUN cd caffe2 && mkdir build && cd build \

    && cmake .. \

    -DCUDA_ARCH_NAME=Manual \

    -DCUDA_ARCH_BIN="60 61" \

    -DCUDA_ARCH_PTX="61" \







    && make -j"$(nproc)" install \

    && ldconfig \

    && make clean \

    && cd .. \

    && rm -rf build


    ENV PYTHONPATH /usr/local

    Build Docker Image and run the container

    1. Now run the build command. This creates a Docker image, which we’re going to tag using -t so it has a friendly name.
      $ nvidia-docker build -t caffe2mofed421 .
    2. Where is your built image? It’s in your machine’s local Docker image registry:
      $ nvidia-docker image
    3. Run a Docker Container in not privileged mode from the remote repository by:
      $ NV_GPU=0 nvidia-docker run --privileged -it -v /data:/data --network host --name=caffe2 caffe2mofed421 bash

    Verify CUDA in the container

    Ensure you are in the container. Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    containerID$ nvidia-smi

    Validate Caffe2 in the container

    To validate the Caffe2 installation run the following commands:

    containerID$ python -m caffe2.python.operator_test.relu_op_test

    Validate MOFED

    Run two containers on two different nodes.

    Check the mofed version and uverbs on each container:

    containerID$ ofed_info -s
    containerID$ ls /dev/infiniband/uverbs1

    Now execute a RDMA bandwidth test over IB between two containers:


    ib_write_bw -a -d mlx5_0 &


    ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits


    Distributed Caffe2 run - sample

    To run distributed Caffe2, I use GLOO (is included in Caffe2) with Redis.

    I use a converted lmdb imagenet dataset in my runs. See an Appendix A to How to Create Imagenet ILSVRC2012 LMDB

    On each container please run the sample command:

    # python /caffe2/caffe2/python/examples/ --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host --redis_port 5555 --num_shards 32 --shard_id 0 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0

    # python /caffe2/caffe2/python/examples/ --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host --redis_port 5555 --num_shards 32 --shard_id 1 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0


    # python /caffe2/caffe2/python/examples/ --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host --redis_port 5555 --num_shards 32 --shard_id 31 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0




    Appendix A - Prepare LMDB format Dataset

    You need to have installed Caffe on bare metal or on Docker(preferred).



    How to Create Imagenet ILSVRC2012 LMDB · rioyokotalab/caffe Wiki · GitHub


    If you don't have, please download the Imagenet ILSVRC2012 dataset.




    # Development kit (Task 1 & 2), 2.5MB


    md5sum ILSVRC2012_devkit_t12.tar.gz


    # Development kit (Task 3), 22MB


    md5sum ILSVRC2012_devkit_t3.tar.gz


    # Training images (Task 1 & 2), 138GB


    md5sum ILSVRC2012_img_train.tar


    # Training images (Task 3), 728MB


    md5sum ILSVRC2012_img_train_t3.tar


    # Validation images (all tasks), 6.3GB


    md5sum ILSVRC2012_img_val.tar


    # Test images (all tasks), 13GB


    md5sum ILSVRC2012_img_test.tar

    md5sum is command to check correctness of download.


    Run Caffe container by:

    # nvidia-docker run --privileged -it -v /data:/data --network host --name=caffe caffe bash # -v /data folder when you save your ILSVRC2012 dataset


    Use following script to unzip after unzip ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar



    for filepath in ${files}
    filename=`basename ${filepath} .tar`
    mkdir ${filename}
    tar -xf ${filename}.tar -C ${filename}


    After that , by using caffe script ,Let's create a lmdb

    1. get label data
      $ cd $CAFFE_HOME/data/ilsvrc12/
      $ ./
      det_synset_words.txt synsets.txt test.txt val.txt imagenet_mean.binaryproto synset_words.txt train.txt
    2. edit $CAFFE_HOME/examples/imagenet/

      #!/usr/bin/env sh

      # Create the imagenet lmdb inputs

      # N.B. set the path to the imagenet train + val data dirs

      set -e






      # Set RESIZE=true to resize the images to 256x256. Leave as false if images have

      # already been resized using another tool.


      if $RESIZE; then



    3. then , execute ./
    4. edit $CAFFE_HOME/examples/imagenet/

      #!/usr/bin/env sh
      # Compute the mean image from the imagenet training lmdb
      # N.B. this is available in data/ilsvrc12


      $TOOLS/compute_image_mean $EXAMPLE/ilsvrc12_train_lmdb \

      echo "Done."

    5. And execute ./