Reference Deployment Guide for RDMA accelerated Caffe2 with an NVIDIA GPU Card over 100Gb Infiniband Network

Version 17

    In this document we will demonstrate a distributed deployment procedure of RDMA accelerated Caffe2 and Mellanox end-to-end 100 Gb/s Infiniband (IB) solution.

    This document describes the process of building the Caffe2 from sources for Ubuntu 16.04.3 LTS on four physical servers.

    We will show how to update and install the NVIDIA drivers, NVIDIA CUDA Toolkit, NVIDIA CUDA® Deep Neural Network library (cuDNN) and Mellanox software and hardware components.

    References

     

     

     

    Overview

     

    What is Caffe2 ?

    Caffe2 is a deep learning framework that provides an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms. You can bring your creations to scale using the power of GPUs in the cloud or to the masses on mobile with Caffe2’s cross-platform libraries. Caffe2 supports Cuda 9.1 & CuDNN7.1 (req. registration), in this guide we will use the installing from sources from their website for a much easier installation. In order to use Caffe2 with GPU support, you must have an NVIDIA GPU with a minimum compute capability of 3.0.

     

    Mellanox’s Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA, JD.com, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build most efficient Machine Learning cluster enhanced by native RDMA over 100Gbps IB network.

     

    Setup Overview

    Before you start, make sure you are aware of the distributed training, see  following link for more info.
    In the distributed Caffe2 configuration described in this guide, we are using the following hardware specification.

     

    Equipment

    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

    Setup Logical Design

    Server Wiring

    In our reference we'll wire only 1st port to IB switch .

    We'll cover the procedure late in Installing Mellanox OFED section.

     

    Server Block Diagram

     


    Network Configuration

    Each server is connected to the SB7700 switch by a 100Gb IB copper cable. The switch port connectivity in our case is as follow:

    • 1st -4th ports – connected to Node servers

    Server names with network configuration provided below

    Server typeServer nameIP and NICS               
    Internal networkExternal network
    Node Server 01clx-mld-41ib0: 12.12.12.41eno1: From DHCP (reserved)
    Node Server 02clx-mld-42ib0: 12.12.12.42eno1: From DHCP (reserved)
    Node Server 03clx-mld-43ib0: 12.12.12.43eno1: From DHCP (reserved)
    Node Server 04clx-mld-44ib0: 12.12.12.44eno1: From DHCP (reserved)
    Node Server 05clx-mld-45ib0: 12.12.12.45eno1: From DHCP (reserved)
    Node Server 06clx-mld-46ib0: 12.12.12.46eno1: From DHCP (reserved)
    Node Server 07clx-mld-47ib0: 12.12.12.47eno1: From DHCP (reserved)
    Node Server 08clx-mld-48ib0: 12.12.12.48eno1: From DHCP (reserved)


    Deployment Guide


    Prerequisites


    Required Software

    Prior to install Caffe2, the following software must be installed.

     

    Disable a Nouveau kernel Driver

     

    Prior to installing NVIDIA last drivers and CUDA in Ubuntu 16.04, the Nouveau kernel driver must be disabled. To disable it, follow the procedure below.

     

    1. Check that the Nouveau kernel driver is loaded.
      $ lsmod |grep nouv
    2. Remove all NVIDIA packages.

      Skip this step if your system is fresh installed.
      $ sudo apt-get remove nvidia* && sudo apt autoremove
    3. Install the packages below for the build kernel.

      $ sudo apt-get install dkms build-essential linux-headers-generic
    4. Block and disable the Nouveau kernel driver.
      $ sudo vim /etc/modprobe.d/blacklist.conf
    5. Insert the follow lines to the blacklist.conf file.
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
    6. Disable the Nouveau kernel module and update the initramfs image.  (Although the nouveau-kms.conf file may not exist, it will not affect this step).
      $ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      $ sudo update-initramfs -u
    7. Reboot
      $ sudo reboot
    8. Check that the Nouveau kernel drive is not loaded.
      $ lsmod |grep nouveau

     

    Install General Dependencies

    1. To install general dependencies, run the commands below or paste each line.
      $ sudo apt-get update
    2. To install Caffe2, you must install the following packages:
      • dev: Enables adding extensions to Python
      • pip: Enables installing and managing of certain Python packages

    To install these packages for Python 2.7

    $ sudo apt-get install -y --no-install-recommends build-essential cmake git libhiredis-dev libgoogle-glog-dev libgtest-dev libiomp-dev libleveldb-dev liblmdb-dev libopencv-dev libsnappy-dev libprotobuf-dev protobuf-compiler python-dev python-pip libgflags-dev
    $ sudo pip install future numpy protobuf

    Install Optional Dependencies

    1. To install optional dependencies, run the commands below or paste each line.
      $ sudo apt-get install -y --no-install-recommends python-pydot
      $ sudo pip install flask graphviz hypothesis jupyter matplotlib pydot python-nvd3 pyyaml requests scikit-image scipy setuptools six tornado

    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

     

    Install the NVIDIA Drivers

     

    The 390 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (http://www.nvidia.com/download/driverResults.aspx/117079/en-us).
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86_64-390.12_1).
    3. Exit the GUI (as the drivers for graphic devices are running at a low level).

       

      $ sudo service lightdm stop

       

    4. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    5. Once you accept the download please follow the steps listed below.
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604-390.12_1.0-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-drivers
      During the run, you will be asked to confirm several things such as the pre-install of something failure, no 32-bit libraries and more.
    6. Once installed using additional drivers, restart your computer.
      $ sudo reboot

     

    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia

    4.jpg

     

    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi


    Enable the Subnet Manager(SM) on the IB Switch

    There are three options to select the best place to locate the SM:

    1. Enabling the SM on one of the managed switches. This is a very convenient and quick operation and make Infiniband ‘plug & play’ easily.
    2. Run /etc/init.d/opensmd on one or more servers. It is recommended to run the SM on a server in case there are 648 nodes or more.
    3. Use Unified Fabric Management (UFM®) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.

    We'll explain options 1 and 2 only

    Option 1: Configuring the SM on a Switch MLNX-OS® all Mellanox switch systems.
    To enable the SM on one of the managed switches follow the next steps.

    1. Login to the switch and enter to config mode:
      Mellanox MLNX-OS Switch Management

      switch login: admin
      Password:
      Last login: Wed Aug 12 23:39:01 on ttyS0

      Mellanox Switch

      switch [standalone: master] > enable
      switch [standalone: master] # conf t
      switch [standalone: master] (config)#
    2. Run the command:
      switch [standalone: master] (config)#ib sm
      switch [standalone: master] (config)#
    3. Check if the SM is running. Run:

      switch [standalone: master] (config)#show ib sm
      enable
      switch [standalone: master] (config)#

    To save the configuration (permanently), run:

    switch (config) # configuration write

     

     

    Option 2: Configuring the SM on a Server (Skip this procedure if you enable SM on switch)

    To start up OpenSM on a server, simply run opensm from the command line on your management node by typing:

    # opensm

    Or:

    Start OpenSM automatically on the head node by editing the /etc/opensm/opensm.conf file.

    Create a configuration file by running:

    # opensm –config /etc/opensm/opensm.conf

    Edit /etc/opensm/opensm.conf file with the following line:

    onboot=yes

    Upon initial installation, OpenSM is configured and running with a default routing algorithm. When running a multi-tier fat-tree cluster, it is recommended to change the following options to create the most efficient routing algorithm delivering the highest performance:

    –routing_engine=updn

    For full details on other configurable attributes of OpenSM, see the “OpenSM – Subnet Manager” chapter of the Mellanox OFED for Linux User Manual.

     

    Installation Mellanox OFED for Ubuntu

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed. For more information click on Mellanox OFED for Linux User Manual.

     

    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your host.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
      http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.

       

      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.tgz

       

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded tgz to /tmp
    3. Mount the ISO image on your machine.

      # cd /tmp

      # tar -xzvf MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64.tgz

      # cd MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64/

    4. Run the installation script.
      # ./mlnxofedinstall --all --force
    5. Restart openbd and
    6. Reboot after the installation finished successfully.

      # /etc/init.d/openibd restart

      # reboot

      By default both ConnectX®-5 VPI ports are initialized as Infiniband ports.

    7. Check the ports’ mode is Infiniband
      # ibv_devinfo

    8. If you see the following - You need to change the interfaces port type to Infiniband
      Capture.JPG
      Change the interfaces port type to Infiniband mode ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports.
      Change the mode to Infiniband. Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=1 is a Infiniband mode
      a. Start mst and see ports names
      # mst start
      # mst status

      b. Change the mode of both ports to Infiniband:

      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=1
      #Port 1 set to IB mode
      # reboot

      c. Queries Infiniband devices and prints about them information that is available for use from userspace.
      # ibv_devinfo
    9. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.

      # ibdev2netdev

      # ifconfig ib0 12.12.12.41 netmask 255.255.255.0

    10. Insert to the /etc/network/interfaces file the lines below after the following lines:

      # vim /etc/network/interfaces

      auto eno1

      iface eno1 inet dhcp

      The new lines:
      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
      Example:
      # vim /etc/network/interfaces

      auto eno1
      iface eno1 inet dhcp

      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
    11. Check the network configuration is set correctly.
      # ifconfig -a

       

    Install Nvidia Toolkit 9.1 (CUDA) & CudNN

     

    Pre-installation Actions

    The following action must be taken before installing the CUDA Toolkit and Driver on the Linux driver:

    • Verify the system has a CUDA-capable GPU
    • Verify the system is running a supported version of Linux
    • Verify the system has gcc installed
    • Verify the system has the correct kernel headers and development packages installed
    • Download the NVIDIA CUDA Toolkit
    • Handle conflicting installation methods

    You can override the install-time prerequisite checks by running the installer with the “-override” flag. Remember that the prerequisites will still be required to use the NVIDIA CUDA Toolkit.

     

    Verify You Have a CUDA-Capable GPU

    To verify that your GPU is CUDA-capable, go to your distribution's equivalent of System Properties, or, from the command line, enter:

    $ lspci | grep -i nvidia

    If you do not see any settings, update the PCI hardware database that Linux maintains by entering “update-pciids” (generally found in /sbin) at the command line and rerun the previous lspci command.

    If your graphics card is from NVIDIA, and it is listed in http://developer.nvidia.com/cuda-gpus, your GPU is CUDA-capable.

    The Release Notes for the CUDA Toolkit also contain a list of supported products.

     

    Verify You Have a Supported Linux Version

    The CUDA Development Tools are only supported on some specific distributions of Linux. These are listed in the CUDA Toolkit release notes.

    To determine which distribution and release number you are running, type the following at the command line:

    $ uname -m && cat /etc/*release

    You should see output similar to the following, modified for your particular system:

     

    x86_64

    Ubuntu 16.04.2 LTS

    The x86_64 line indicates you are running on a 64-bit system. The remainder gives information about your distribution.

     

    Verify the System Has a gcc Compiler Installed

    The gcc compiler is required for development using the CUDA Toolkit. It is not required for running CUDA applications. It is generally installed as part of the Linux installation, and in most cases the version of gcc installed with a supported version of Linux will work correctly.
    To verify the version of gcc installed on your system, type the following on the command line:

    $ gcc --version

    You should see output similar to the following, modified for your particular system:

    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 2016060

    If an error message is displayed, you need to install the “development tools” from your Linux distribution or obtain a version of gcc and its accompanying toolchain from the Web.

     

    Verify the System has the Correct Kernel Headers and Development Packages Installed

    The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel to be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 3.17.4-301, the 3.17.4-301 kernel headers and development packages must also be installed.

    While the Runfile installation performs no package validation, the RPM and DEB installations of the driver will make an attempt to install the kernel header and development packages if no version of these packages is currently installed. However, it will install the latest version of these packages, which may or may not match the version of the kernel your system is using. Therefore, it is best to manually ensure the correct version of the kernel headers and development packages are installed prior to installing the CUDA Drivers, as well as whenever you change the kernel version.

    The version of the kernel your system is running can be found by running the following command:

    $ uname -r

    This is the version of the kernel headers and development packages that must be installed prior to installing the CUDA Drivers. This command will be used multiple times below to specify the version of the packages to install. Note that below are the common-case scenarios for kernel usage. More advanced cases, such as custom kernel branches, should ensure that their kernel headers and sources match the kernel build they are running.

    The kernel headers and development packages for the currently running kernel can be installed with:

    $ sudo apt-get install linux-headers-$(uname -r)

     

    Installation Process

    1. Download the base installation .run file from NVIDIA CUDA website.
    2. Create an account if you do not already have one, and log in (an account is also required to download cuDNN).
    3. Choose Linux > x86_64 > Ubuntu > 16.04 > runfile (local) and download the base installer and the patch.
      MAKE SURE YOU SAY NO TO INSTALLING NVIDIA DRIVERS!
      Make sure you select yes to creating a symbolic link to your CUDA directory.
      $ cd /root # or directory to where you downloaded file
      $ sudo sh cuda_9.1.85_387.26_linux.run --override # hold s to skip
    4. Install CUDA into: /usr/local/cuda.
      Do you accept the previously read EULA?
      accept/decline/quit: accept
      Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 387.26?
      (y)es/(n)o/(q)uit: N
      Install the CUDA 9.1 Toolkit?
      (y)es/(n)o/(q)uit: Y
      Enter Toolkit Location[ default is /usr/local/cuda-9.1 ]: Enter
      Do you want to install a symbolic link at /usr/local/cuda?
      (y)es/(n)o/(q)uit: Y
      Install the CUDA 0.1 Samples?
      (y)es/(n)o/(q)uit: Y
      Enter CUDA Samples Location[ default is /root ]: Enter
      Installing the CUDA Toolkit in /usr/local/cuda-9.1 ...

    To install cuDNN download cuDNN v7 for Cuda 9.1 from the NVIDIA website and extract into /usr/local/cuda via:

    $ tar -xzvf cudnn-9.1-linux-x64-v7.solitairetheme8

    $ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
    $ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
    $ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


    Post-installation Actions

     

    Mandatory Actions

    Some actions must be taken after the installation before the CUDA Toolkit and Driver can be used.

     

       Environment Setup (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup)
    • The “PATH” variable needs to include /usr/local/cuda-9.1/bin
    • In addition, when using the .run file installation method, the “LD_LIBRARY_PATH” variable needs to contain /usr/local/cuda-9.1/lib64.
    • Update your bash file.
      $ vim ~/.bashrc
      This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
      export CUDA_HOME=/usr/local/cuda-9.1
      export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
      export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
    • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
      $ source ~/.bashrc
    • Check that the paths have been properly modified.

      $ echo $CUDA_HOME
      $ echo $PATH
      $ echo $LD_LIBRARY_PATH

    • Set the “LD_LIBRARY_PATH” and “CUDA_HOME” environment variables. Consider adding the commands below to your ~/.bash_profile. These assume your CUDA installation is in /usr/local/cuda-8.0.
      $ vim ~/.bash_profile
      This will open your file in a text editor which you will scroll to the bottom and add these lines:
      export CUDA_HOME=/usr/local/cuda-9.1
      export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
      export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

     

    Other actions are recommended to verify the integrity of the installation.

    • Install Writable Samples (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#install-samples)
      In order to modify, compile, and run the samples, the samples must be installed with “write” permissions. A convenience installation script is provided:
      $ cuda-install-samples-9.1.sh ~
      This script is installed with the cuda-samples-8-0 package. The cuda-samples-8-0 package installs only a read-only copy in /usr/local/cuda-9.1/samples.
    • Verify the Installation (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation)Before continuing, it is important to verify that the CUDA toolkit can find and communicate correctly with the CUDA-capable hardware. To do this, you need to compile and run some of the included sample programs.
      Ensure the PATH and, if using the .run file installation method the LD_LIBRARY_PATH variables are set correctly. See section Mandatory Actions.
    • Verify the Driver Version (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-driver)
      * If you installed the driver, verify that the correct version of it is loaded.
      * If you did not install the driver, or are using an operating system where the driver is not loaded via a kernel module, such as L4T, skip this step.
      When the driver is loaded, the driver version can be found by executing the following command.
      $ cat /proc/driver/nvidia/version
      You should see output similar to the following:
      NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.12  Wed Dec 20 07:19:16 PST 2017
      GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)
    • Compiling the Examples (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#compiling-examples)
      The version of the CUDA Toolkit can be checked by running “nvcc –V” in a terminal window. The “nvcc” command runs the compiler driver that compiles the CUDA programs. It calls the “gcc” compiler for C code and the NVIDIA PTX compiler for the CUDA code.
      $ nvcc -V
      You should see output similar to the following:
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2016 NVIDIA Corporation
      Built on Fri_Nov__3_21:07:56_CDT_2017
      Cuda compilation tools, release 9.1, V9.1.85
      The NVIDIA CUDA Toolkit includes sample programs in the source form. You should compile them by changing to ~/NVIDIA_CUDA-8.0_Samples and typing make. The resulting binaries will be placed under ~/NVIDIA_CUDA-8.0_Samples/bin.
      $ cd ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/
      $ make
      $ cd ~/NVIDIA_CUDA-9.1_Samples

      $ ./bin/x86_64/linux/release/deviceQuery
    • Running the Binaries (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#running-binaries)
      After the compilation, find and run deviceQuery under ~/NVIDIA_CUDA-8.0_Samples. If the CUDA software is installed and configured correctly, the output should look similar to the below.

      The exact appearance and the output lines might be different on your system. The important outcomes are that a device was found (the first highlighted line), that the device matches the one on your system (the second highlighted line), and that the test passed (the final highlighted line).
      If a CUDA-capable device and the CUDA Driver are installed but the deviceQuery reports that no CUDA-capable devices are present, this likely means that the /dev/nvidia* files are missing or have the wrong permissions.
      Running the bandwidthTest program ensures that the system and the CUDA-capable device are able to communicate correctly. Its output is shown below.

      $ cd ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/

      $ make

      $ cd ~/NVIDIA_CUDA-9.1_Samples

      $ ./bin/x86_64/linux/release/bandwidthTest

    Note that the measurements for your CUDA-capable device description will vary from system to system. The important point is that you obtain measurements, and that the second-to-last line confirms that all necessary tests passed.
    Should the tests not pass, make sure you have a CUDA-capable NVIDIA GPU on your system and make sure it is properly installed.
    If you run into difficulties with the link step (such as libraries not being found), consult the Linux Release Notes found in the doc folder in the CUDA Samples directory.

     

    Installing Caffe2

     

    Clone the Caffe2 Repository

    To clone the latest Caffe2 repository, issue the following commands:

    $ cd ~
    $ git clone --branch master --recursive https://github.com/caffe2/caffe2.git

    The preceding git clone command creates a subdirectory called “caffe2”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:

    $ cd caffe2

    $ git submodule sync --recursive

    $ git submodule update --init --recursive

     

    Build Caffe2

    $ mkdir build
    $ cd build
    $ cmake .. -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="60 61" -DCUDA_ARCH_PTX="61" -DUSE_NNPACK=OFF -DUSE_ROCKSDB=OFF -DUSE_GLOO=ON -DUSE_REDIS=ON -DUSE_IBVERBS=ON -DUSE_MPI=OFF
    $ make -j"$(nproc)" install
    $
    ldconfig
    $
    make clean

    Caffe2 has an open issue - Eigen with CUDA9: fatal error:math_functions.hpp:No such file or directory.

    This was fixed in eigen at https://bitbucket.org/eigen/eigen/commits/034b6c3e101792a3cc3ccabd9bfaddcabe85bb58?at=default

    Until caffe2 updates their eigen submodule, you can just make that change manually.

    Test the Caffe2 Installation

    Run this to see if your Caffe2 installation was successful.

    $ cd ~ && python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"


    Validate Caffe2 Installation

    To validate the Caffe2 installation run the following commands:

    $ python -m caffe2.python.operator_test.relu_op_test

     

    Validate MOFED

    Check the mofed version and uverbs:

    $ ofed_info -s
    $ ls /dev/infiniband/uverbs1

    Run Bandwidth stress over IB in container.:

    Server

    ib_write_bw -a -d mlx5_0 &

    Client

    ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits

    In this way you can run Bandwidth stress over IB between containers.

     

    Distributed Caffe2 run - sample

    To run distributed Caffe2, I use GLOO (is included in Caffe2) with Redis.

    I use a converted lmdb imagenet dataset in my runs. See an Appendix A to How to Create Imagenet ILSVRC2012 LMDB

    On each node please run the sample command:

    # python /caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 4 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 8 --shard_id 0 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0

    # python /caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 4 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 8 --shard_id 1 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0

    ....

    # python /caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 4 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 8 --shard_id 7 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_0

     

    Done!

     

    Appendix A

    You need to have installed Caffe on bare metal or on Docker(preferred).

     

     

    Source:

    How to Create Imagenet ILSVRC2012 LMDB · rioyokotalab/caffe Wiki · GitHub

     

    If you don't have, please download the Imagenet ILSVRC2012 dataset.

     

     

    #!/bin/bash

     

    # Development kit (Task 1 & 2), 2.5MB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_devkit_t12.tar.gz

    md5sum ILSVRC2012_devkit_t12.tar.gz

     

    # Development kit (Task 3), 22MB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_devkit_t3.tar.gz

    md5sum ILSVRC2012_devkit_t3.tar.gz

     

    # Training images (Task 1 & 2), 138GB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar

    md5sum ILSVRC2012_img_train.tar

     

    # Training images (Task 3), 728MB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train_t3.tar

    md5sum ILSVRC2012_img_train_t3.tar

     

    # Validation images (all tasks), 6.3GB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar

    md5sum ILSVRC2012_img_val.tar

     

    # Test images (all tasks), 13GB

    wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_test.tar

    md5sum ILSVRC2012_img_test.tar

      md5sum is command to check correctness of download.

     

    Run Caffe container by:

    # nvidia-docker run --privileged -it -v /data:/data --network host --name=caffe caffe bash         # -v /data folder when you save your ILSVRC2012 dataset

     

    Use following script to unzip after unzip ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar

     

    #!/bin/sh

    files="./n*.tar"
    for filepath in ${files}
    do
      filename=`basename ${filepath} .tar`
      mkdir ${filename}
      tar -xf ${filename}.tar -C ${filename}
    done

     

    After that , by using caffe script ,Let's create a lmdb

    1. get label data
      $ cd
      $CAFFE_HOME/data/ilsvrc12/
      $ ./get_ilsvrc_aux.sh
      det_synset_words.txt imagenet.bet.pickle synsets.txt test.txt val.txt
      get_ilsvrc_aux.sh imagenet_mean.binaryproto synset_words.txt train.txt
    2. edit $CAFFE_HOME/examples/imagenet/create_imagenet.sh
      #!/usr/bin/env sh
      # Create the imagenet lmdb inputs
      # N.B. set the path to the imagenet train + val data dirs
      set -e

      EXAMPLE=$CAFFE_HOME/examples/imagenet
      DATA=data/ilsvrc12
      TOOLS=$CAFFE_HOME/build/toolsTRAIN_DATA_ROOT=/path/to/imagenet/train/VAL_DATA_ROOT=/path/to/imagenet/val/

      # Set RESIZE=true to resize the images to 256x256. Leave as false if images have
      # already been resized using another tool.
      RESIZE=true
      if $RESIZE; then 
        RESIZE_HEIGHT=256
       
        RESIZE_WIDTH=256  
      ........
    3. then , execute ./create_imagenet.sh
    4. edit $CAFFE_HOME/examples/imagenet/make_imagenet_mean.sh

      #!/usr/bin/env sh

      # Compute the mean image from the imagenet training lmdb
      # N.B. this is available in data/ilsvrc12

      EXAMPLE=/data/lmdb
      DATA=$CAFFE_HOME/data/ilsvrc12
      TOOLS=$CAFFE_HOME/build/tools

      $TOOLS/compute_image_mean $EXAMPLE/ilsvrc12_train_lmdb \ 
      $DATA/ResNet_mean.binaryproto

      echo "Done."

    5. And execute ./make_imagenet_mean.sh

     

    Finish.