Reference Deployment Guide for RDMA accelerated TensorFlow with an NVIDIA GPU Card over 100Gb Infiniband Network running on  Linux Containers

Version 9

    In this document we will demonstrate a distributed deployment procedure of RDMA accelerated TensorFlow running on Linux Containers (LXC) and Mellanox end-to-end 100 Gb/s Infiniband (IB) solution.

    This document describes the process of building the TensorFlow from sources for Ubuntu 16.04.2 LTS and LXD 2.16 on four physical servers.

    We will show how to update and install the NVIDIA drivers, NVIDIA CUDA Toolkit, NVIDIA CUDA® Deep Neural Network library (cuDNN) install Bazel, TensorFlow and Mellanox software and hardware components on host and on LXD container.

    References

     

     

    Overview

     

    What is TensorFlow ?

    TensorFlow is an open source software library developed by the Google Brain team for the purpose of conducting machine learning and deep neural networks research. The library performs numerical computation by using data flow graphs, where the nodes in the graph represent mathematical operations and the graph edges represent the multidimensional data arrays (tensors) which communicate between the nodes. TensorFlow supports Cuda 8.0 & CuDNN 6.0 (req. registration), in this guide we will use the installing from sources from their website for a much easier installation. In order to use TensorFlow with GPU support, you must have an NVIDIA GPU with a minimum compute capability of 3.0.

     

    What's LXC ?

    LXC is a userspace interface for the Linux kernel containment features.Through a powerful API and simple tools, it lets Linux users easily create and manage system or application containers.

    LXC containers are often considered as something in the middle between a chroot and a full fledged virtual machine. The goal of LXC is to create an environment as close as possible to a standard Linux installation but without the need for a separate kernel.

     

    What’s LXD ?

    At its simplest, LXD (Linux Container Daemon) is a daemon which provides a REST API to drive LXC containers. LXD is not a rewrite of LXC. Under the hood, LXD uses LXC through liblxc and its Go binding. Its main goal is to provide a user experience that’s similar to that of virtual machines but using Linux containers rather than hardware virtualization.

     

    Mellanox’s Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA, JD.com, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post we will show how to build most efficient Machine Learning cluster enhanced by native RDMA over 100Gbps IB network.

     

    Setup Overview

    Before you start, make sure you are aware of the distributed TensorFlow architecture, see Glossary in Distributed TensorFlow for more info.
    In the distributed TensorFlow configuration described in this guide, we are using the following hardware specification.

     

    Equipment

     

    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

    Setup Logical Design

    Server Logical Design

     

     

    Server Wiring

    If you have Dual Port NIC you shall disable one port.
    Due to certain limitations in current TensorFlow version you can face issues if both ports will be enabled.

    In our reference we'll wire 1st port to IB switch and will disable the 2nd port.

    We'll cover the procedure late in Installing Mellanox OFED section.

     

    Server Block Diagram

     


    Network Configuration

    Each server is connected to the SB7700 switch by a 100Gb IB copper cable. The switch port connectivity in our case is as follow:

    • 1st -4th ports – connected to Worker servers

    Server names with network configuration provided below

    Server typeServer nameIP and NICS               
    Internal networkExternal network
    Worker Server 01clx-mld-41ib0: 12.12.12.41eno1: From DHCP (reserved)
    Worker Server 02clx-mld-42ib0: 12.12.12.42eno1: From DHCP (reserved)
    Worker Server 03clx-mld-43ib0: 12.12.12.43eno1: From DHCP (reserved)
    Worker Server 04clx-mld-44ib0: 12.12.12.44eno1: From DHCP (reserved)


    Deployment Guide


    Prerequisites


    Required a Host Software

    Prior to install Tensorflow, the following software must be installed.

     

    Disable a Nouveau kernel Driver on a Host

     

    Prior to installing NVIDIA last drivers and CUDA in Ubuntu 16.04, the Nouveau kernel driver must be disabled. To disable it, follow the procedure below.

     

    1. Check that the Nouveau kernel driver is loaded.
      $ lsmod |grep nouv
    2. Remove all NVIDIA packages.

      Skip this step if your system is fresh installed.
      $ sudo apt-get remove nvidia* && sudo apt autoremove
    3. Install the packages below for the build kernel.

      $ sudo apt-get install dkms build-essential linux-headers-generic -y
    4. Block and disable the Nouveau kernel driver.
      $ sudo vim /etc/modprobe.d/blacklist.conf
    5. Insert the follow lines to the blacklist.conf file.
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
    6. Disable the Nouveau kernel module and update the initramfs image.  (Although the nouveau-kms.conf file may not exist, it will not affect this step).
      $ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      $ sudo update-initramfs -u
    7. Reboot
      $ sudo reboot
    8. Check that the Nouveau kernel drive is not loaded.
      $ lsmod |grep nouveau

     

    Install General Dependencies

    1. To install general dependencies, run the commands below or paste each line.
      $ sudo apt-get install openjdk-8-jdk git build-essential python-virtualenv swig python-wheel libcupti-dev -y
    2. To install TensorFlow, you must install the following packages:
      • Numpy: A numerical processing package that TensorFlow requires
      • dev: Enables adding extensions to Python
      • pip: Enables installing and managing of certain Python packages
      • wheel: Enables management of Python compressed packages in the wheel (.whl) format

    To install these packages for Python 2.7

    $ sudo apt-get install python-numpy python-dev python-pip python-wheel -y

     

     

    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

     

    Install the NVIDIA Drivers on a Host

     

    The 367 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (http://www.nvidia.com/download/driverResults.aspx/117079/en-us).
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86_64-375.51).
    3. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    4. Once you accept the download please follow the steps listed below.
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-drivers -y
      During the run, you will be asked to confirm several things such as the pre-install of something failure, no 32-bit libraries and more.
    5. Once installed using additional drivers, restart your computer.
      $ sudo reboot

     

    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia

    4.jpg

     

    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi


    Enable the Subnet Manager(SM) on the IB Switch

     

    Refer to the MLNX-OS User Manual to become familiar with switch software (located at support.mellanox.com).
    Before starting to use of the Mellanox switch, we recommend that you upgrade the switch to the latest MLNX-OS version.

    There are three options to select the best place to locate the SM:

    1. Enabling the SM on one of the managed switches. This is a very convenient and quick operation and make Infiniband ‘plug & play’ easily.
    2. Run /etc/init.d/opensmd on one or more servers. It is recommended to run the SM on a server in case there are 648 nodes or more.
    3. Use Unified Fabric Management (UFM®) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.

    We'll explain options 1 and 2 only

    Option 1: Configuring the SM on a Switch MLNX-OS® all Mellanox switch systems.
    To enable the SM on one of the managed switches follow the next steps.

    1. Login to the switch and enter to config mode:
      Mellanox MLNX-OS Switch Management

      switch login: admin
      Password:
      Last login: Wed Aug 12 23:39:01 on ttyS0

      Mellanox Switch

      switch [standalone: master] > enable
      switch [standalone: master] # conf t
      switch [standalone: master] (config)#
    2. Run the command:
      switch [standalone: master] (config)#ib sm
      switch [standalone: master] (config)#
    3. Check if the SM is running. Run:

      switch [standalone: master] (config)#show ib sm
      enable
      switch [standalone: master] (config)#

    To save the configuration (permanently), run:

    switch (config) # configuration write

     

     

    Option 2: Configuring the SM on a Server (Skip this procedure if you enable SM on switch)

    To start up OpenSM on a server, simply run opensm from the command line on your management node by typing:

    # opensm

    Or:

    Start OpenSM automatically on the head node by editing the /etc/opensm/opensm.conf file.

    Create a configuration file by running:

    # opensm –config /etc/opensm/opensm.conf

    Edit /etc/opensm/opensm.conf file with the following line:

    onboot=yes

    Upon initial installation, OpenSM is configured and running with a default routing algorithm. When running a multi-tier fat-tree cluster, it is recommended to change the following options to create the most efficient routing algorithm delivering the highest performance:

    –routing_engine=updn

    For full details on other configurable attributes of OpenSM, see the “OpenSM – Subnet Manager” chapter of the Mellanox OFED for Linux User Manual.

     

    Installation Mellanox OFED for Ubuntu on a Host

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed. For more information click on Mellanox OFED for Linux User Manual.

     

    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your host.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
      http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.

       

      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.tgz

       

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded tgz to /tmp
      # cd /tmp
      # tar -xzvf MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz
      # cd MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64/
    3. Run the installation script.
      # ./mlnxofedinstall
    4. Reboot after the installation finished successfully.

      # /etc/init.d/openibd restart

      # reboot

      By default both ConnectX®-5 VPI ports are initialized as Infiniband ports.

    5. Disable unused the 2nd port on the device.
      Identify PCI ID of your NIC ports:

      # lspci | grep Mellanox

      05:00.0 Infiniband controller: Mellanox Technologies Device 1019

      05:00.1 Infiniband controller: Mellanox Technologies Device 1019

      Disable 2nd port
      # echo 0000:05:00.1 > /sys/bus/pci/drivers/mlx5_core/unbind
    6. Check the ports’ mode is Infiniband
      # ibv_devinfo

    7. If you see the following - You need to change the interfaces port type to Infiniband
      Capture.JPG
      Change the interfaces port type to Infiniband mode ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports.
      Change the mode to Infiniband. Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=1 is a Infiniband mode
      a. Start mst and see ports names
      # mst start
      # mst status

      b. Change the mode of both ports to Infiniband:

      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=1
      #Port 1 set to IB mode
      # reboot

      After each reboot you need to Disable 2nd port.
      c. Queries Infiniband devices and prints about them information that is available for use from userspace.

       

      # ibv_devinfo

       

    8. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.

      # ibdev2netdev

      # ifconfig ib0 12.12.12.41 netmask 255.255.255.0

    9. Insert to the /etc/network/interfaces file the lines below after the following lines:

      # vim /etc/network/interfaces

      auto eno1

      iface eno1 inet dhcp

      The new lines:
      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
      Example:
      # vim /etc/network/interfaces

      auto eno1
      iface eno1 inet dhcp

      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
    10. Check the network configuration is set correctly.
      # ifconfig -a

       

    Install Nvidia Toolkit 8.0 (CUDA) & CudNN on a Host

     

    Pre-installation Actions

    The following action must be taken before installing the CUDA Toolkit and Driver on the Linux driver:

    • Verify the system has a CUDA-capable GPU
    • Verify the system is running a supported version of Linux
    • Verify the system has gcc installed
    • Verify the system has the correct kernel headers and development packages installed
    • Download the NVIDIA CUDA Toolkit
    • Handle conflicting installation methods

    You can override the install-time prerequisite checks by running the installer with the “-override” flag. Remember that the prerequisites will still be required to use the NVIDIA CUDA Toolkit.

     

    Verify You Have a CUDA-Capable GPU

    To verify that your GPU is CUDA-capable, go to your distribution's equivalent of System Properties, or, from the command line, enter:

    $ lspci | grep -i nvidia

    If you do not see any settings, update the PCI hardware database that Linux maintains by entering “update-pciids” (generally found in /sbin) at the command line and rerun the previous lspci command.

    If your graphics card is from NVIDIA, and it is listed in http://developer.nvidia.com/cuda-gpus, your GPU is CUDA-capable.

    The Release Notes for the CUDA Toolkit also contain a list of supported products.

     

    Verify You Have a Supported Linux Version

    The CUDA Development Tools are only supported on some specific distributions of Linux. These are listed in the CUDA Toolkit release notes.

    To determine which distribution and release number you are running, type the following at the command line:

    $ uname -m && cat /etc/*release

    You should see output similar to the following, modified for your particular system:

     

    x86_64

    Ubuntu 16.04.2 LTS

    The x86_64 line indicates you are running on a 64-bit system. The remainder gives information about your distribution.

     

    Verify the System Has a gcc Compiler Installed

    The gcc compiler is required for development using the CUDA Toolkit. It is not required for running CUDA applications. It is generally installed as part of the Linux installation, and in most cases the version of gcc installed with a supported version of Linux will work correctly.
    To verify the version of gcc installed on your system, type the following on the command line:

    $ gcc --version

    You should see output similar to the following, modified for your particular system:

    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 2016060

    If an error message is displayed, you need to install the “development tools” from your Linux distribution or obtain a version of gcc and its accompanying toolchain from the Web.

     

    Verify the System has the Correct Kernel Headers and Development Packages Installed

    The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel to be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 3.17.4-301, the 3.17.4-301 kernel headers and development packages must also be installed.

    While the Runfile installation performs no package validation, the RPM and DEB installations of the driver will make an attempt to install the kernel header and development packages if no version of these packages is currently installed. However, it will install the latest version of these packages, which may or may not match the version of the kernel your system is using. Therefore, it is best to manually ensure the correct version of the kernel headers and development packages are installed prior to installing the CUDA Drivers, as well as whenever you change the kernel version.

    The version of the kernel your system is running can be found by running the following command:

    $ uname -r

    This is the version of the kernel headers and development packages that must be installed prior to installing the CUDA Drivers. This command will be used multiple times below to specify the version of the packages to install. Note that below are the common-case scenarios for kernel usage. More advanced cases, such as custom kernel branches, should ensure that their kernel headers and sources match the kernel build they are running.

    The kernel headers and development packages for the currently running kernel can be installed with:

    $ sudo apt-get install linux-headers-$(uname -r)

     

    Installation Process

    1. Download the base installation .run file from NVIDIA CUDA website.
    2. Create an account if you do not already have one, and log in (an account is also required to download cuDNN).
    3. Choose Linux > x86_64 > Ubuntu > 16.04 > runfile (local) and download the base installer and the patch.
      MAKE SURE YOU SAY NO TO INSTALLING NVIDIA DRIVERS!
      Make sure you select yes to creating a symbolic link to your CUDA directory.
      $ cd /root # or directory to where you downloaded file
      $ sudo sh cuda_8.0.61_375.26_linux.run --override # hold s to skip
    4. Install CUDA into: /usr/local/cuda.
      Do you accept the previously read EULA?
      accept/decline/quit: accept
      Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 375.26?
      (y)es/(n)o/(q)uit: N
      Install the CUDA 8.0 Toolkit?
      (y)es/(n)o/(q)uit: Y
      Enter Toolkit Location[ default is /usr/local/cuda-8.0 ]: Enter
      Do you want to install a symbolic link at /usr/local/cuda?
      (y)es/(n)o/(q)uit: Y
      Install the CUDA 8.0 Samples?
      (y)es/(n)o/(q)uit: Y
      Enter CUDA Samples Location[ default is /root ]: Enter
      Installing the CUDA Toolkit in /usr/local/cuda-8.0 ...

    To install cuDNN download cuDNN v6.0.20-1 for Cuda 8.0 from the NVIDIA website and extract into /usr/local/cuda via:

    $ tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz

    $ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
    $ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
    $ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


    Post-installation Actions

     

    Mandatory Actions

    Some actions must be taken after the installation before the CUDA Toolkit and Driver can be used.

     

       Environment Setup (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup)
    • The “PATH” variable needs to include /usr/local/cuda-8.0/bin
    • In addition, when using the .run file installation method, the “LD_LIBRARY_PATH” variable needs to contain /usr/local/cuda-8.0/lib64.
    • Update your bash file.
      $ vim ~/.bashrc
      This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
      export CUDA_HOME=/usr/local/cuda-8.0
      export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
      export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
    • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
      $ source ~/.bashrc
    • Check that the paths have been properly modified.

      $ echo $CUDA_HOME
      $ echo $PATH
      $ echo $LD_LIBRARY_PATH

    • Set the “LD_LIBRARY_PATH” and “CUDA_HOME” environment variables. Consider adding the commands below to your ~/.bash_profile. These assume your CUDA installation is in /usr/local/cuda-8.0.
      $ vim ~/.bash_profile
      This will open your file in a text editor which you will scroll to the bottom and add these lines:
      export CUDA_HOME=/usr/local/cuda-8.0
      export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
      export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

     

    Other actions are recommended to verify the integrity of the installation.

    • Install Writable Samples (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#install-samples)
      In order to modify, compile, and run the samples, the samples must be installed with “write” permissions. A convenience installation script is provided:
      $ cuda-install-samples-8.0.sh ~
      This script is installed with the cuda-samples-8-0 package. The cuda-samples-8-0 package installs only a read-only copy in /usr/local/cuda-8.0/samples.
    • Verify the Installation (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation)Before continuing, it is important to verify that the CUDA toolkit can find and communicate correctly with the CUDA-capable hardware. To do this, you need to compile and run some of the included sample programs.
      Ensure the PATH and, if using the .run file installation method the LD_LIBRARY_PATH variables are set correctly. See section Mandatory Actions.
    • Verify the Driver Version (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-driver)
      * If you installed the driver, verify that the correct version of it is loaded.
      * If you did not install the driver, or are using an operating system where the driver is not loaded via a kernel module, such as L4T, skip this step.
      When the driver is loaded, the driver version can be found by executing the following command.
      $ cat /proc/driver/nvidia/version
      You should see output similar to the following:
      NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.20  Tue Nov 15
      16:49:10 PST 2016
      GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
    • Compiling the Examples (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#compiling-examples)
      The version of the CUDA Toolkit can be checked by running “nvcc –V” in a terminal window. The “nvcc” command runs the compiler driver that compiles the CUDA programs. It calls the “gcc” compiler for C code and the NVIDIA PTX compiler for the CUDA code.
      $ nvcc -V
      You should see output similar to the following:
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2016 NVIDIA Corporation
      Built on Tue_Jan_10_13:22:03_CST_2017
      Cuda compilation tools, release 8.0, V8.0.61
      The NVIDIA CUDA Toolkit includes sample programs in the source form. You should compile them by changing to ~/NVIDIA_CUDA-8.0_Samples and typing make. The resulting binaries will be placed under ~/NVIDIA_CUDA-8.0_Samples/bin.
      $ cd ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/
      $ make
      $ cd ~/NVIDIA_CUDA-8.0_Samples

      $ ./bin/x86_64/linux/release/deviceQuery
    • Running the Binaries (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#running-binaries)
      After the compilation, find and run deviceQuery under ~/NVIDIA_CUDA-8.0_Samples. If the CUDA software is installed and configured correctly, the output should look similar to the below.

      The exact appearance and the output lines might be different on your system. The important outcomes are that a device was found (the first highlighted line), that the device matches the one on your system (the second highlighted line), and that the test passed (the final highlighted line).
      If a CUDA-capable device and the CUDA Driver are installed but the deviceQuery reports that no CUDA-capable devices are present, this likely means that the /dev/nvidia* files are missing or have the wrong permissions.
      Running the bandwidthTest program ensures that the system and the CUDA-capable device are able to communicate correctly. Its output is shown below.

      $ cd ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest/

      $ make

      $ cd ~/NVIDIA_CUDA-8.0_Samples

      $ ./bin/x86_64/linux/release/bandwidthTest

    Note that the measurements for your CUDA-capable device description will vary from system to system. The important point is that you obtain measurements, and that the second-to-last line confirms that all necessary tests passed.
    Should the tests not pass, make sure you have a CUDA-capable NVIDIA GPU on your system and make sure it is properly installed.
    If you run into difficulties with the link step (such as libraries not being found), consult the Linux Release Notes found in the doc folder in the CUDA Samples directory.

     

    LXD installing and configured

    LXD installing

    To install the LXD (current version 2.16), we recommend use official Ubuntu PPA (Personal Package Archive):

    $ sudo apt-add-repository ppa:ubuntu-lxc/stable

    $ sudo apt update

    $ sudo apt dist-upgrade

    $ sudo apt install lxd

    LXD configuring

    To config storage and network go through the whole LXD step by step setup with:

    $ sudo lxd init

    Here is an example execution of the “init” command. In the example we configure the installation with default "dir" storage backend and with a “lxdbr0” bridge as a convenience.

    This bridge comes unconfigured by default, offering only IPv6 link-local connectivity through an HTTP proxy.

    Do you want to configure a new storage pool (yes/no) [default=yes]? Enter

    Name of the new storage pool [default=default]: Enter

    Name of the storage backend to use (dir, btrfs, lvm) [default=dir]: Enter

    Would you like LXD to be available over the network (yes/no) [default=no]? Enter

    Would you like stale cached images to be updated automatically (yes/no) [default=yes]? Enter

    Would you like to create a new network bridge (yes/no) [default=yes]?Enter

    What should the new bridge be called [default=lxdbr0]? Enter

    What IPv4 address should be used (CIDR subnet notation, "auto" or "none") [default=auto]? Enter

    What IPv6 address should be used (CIDR subnet notation, "auto" or "none") [default=auto]? none

                   

    LXD has been successfully configured.

    You can then look at the “lxdbr0” bridge config with:

    $ lxc network show lxdbr0

    Its output is shown below.

    config:

      ipv4.address: 10.143.11.1/24

      ipv4.nat: "true"

      ipv6.address: none

    description: ""

    name: lxdbr0

    type: bridge

    Preparing Container's Network

    Create a /etc/dnsmasq.conf.lab file

    $ vim /etc/dnsmasq.conf.lab

    and add these lines:

    domain=lab-ml.cloudx.mlnx

    # verbose

    log-queries

    log-dhcp

    dhcp-option=6,8.8.8.8

    Run following commands to change ipv4 network and add dnsmasq.conf.lab configuration:

    $ lxc network set lxdbr0 ipv4.address 10.10.43.1/24                                                         

    $ lxc network set lxdbr0 raw.dnsmasq "conf-file=/etc/dnsmasq.conf.lab" 

    and  look at the “lxdbr0” bridge config with:

    $ lxc network show lxdbr0

    Its output is shown below.

    config:

      ipv4.address: 10.10.43.1/24

      ipv4.nat: "true"

      ipv6.address: none

      raw.dnsmasq: conf-file=/etc/dnsmasq.conf.lab

    description: ""

    name: lxdbr0

    type: bridge

    Changing LXD service configuration for container's static MAC and IP addresses (Optional)

    Run this procedure on each host.

    Edit lxd service file:

    $ vim /lib/systemd/system/lxd.service

    add following line ExecStartPost=/bin/bash -c 'rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254}; do echo "00:16:3e:43:01:$(printf '%02x' $i),10.10.43.$i,c43$i" >> /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts ; done'

    (change c43 in another hosts):

    [Service]

     

    EnvironmentFile=-/etc/environment

    ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load

    ExecStart=/usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log

    ExecStartPost=/usr/bin/lxd waitready --timeout=600

    ExecStartPost=/bin/bash -c 'rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254}; do echo "00:16:3e:43:01:$(printf '%02x' $i),10.10.43.$i,c43$i" >> /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts ; done'

     

    Restart the lxd service:

    $ systemctl daemon-reload

    $ killall -SIGHUP dnsmasq

    $ service lxd restart

    $ service lxd status

     

    Check /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts file:

    $ cat /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    00:16:3e:43:01:02,10.10.43.2,c432
    00:16:3e:43:01:03,10.10.43.3,c433

    00:16:3e:43:01:04,10.10.43.4,c434

    00:16:3e:43:01:05,10.10.43.5,c435

    00:16:3e:43:01:06,10.10.43.6,c436  

    ...                                            

    If you don't see it, please rerun lxd service and check again:

    $ service lxd restart

     

    Check LXD service status:

    $ service lxd status

    lxd.service - LXD - main daemon

       Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)

      Drop-In: /etc/systemd/system/lxd.service.d

               override.conf

       Active: active (running) since Thu 2017-08-10 14:57:33 IDT; 3min 38s ago

         Docs: man:lxd(1)

      Process: 6406 ExecStartPost=/bin/bash -c rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254};

      Process: 6326 ExecStartPost=/usr/bin/lxd waitready --timeout=600 (code=exited, status=0/SUCCESS)

      Process: 6314 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)

    Main PID: 6325 (lxd)

       

       Memory: 10.1M

          CPU: 324ms

       CGroup: /system.slice/lxd.service

               6325 /usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log

               6391 dnsmasq --strict-order --bind-interfaces --pid-file=/var/lib/lxd/networks/lxdbr0/dnsmasq.pid --e

     

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: using local addresses only for domain lxd

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: reading /etc/resolv.conf

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: using local addresses only for domain lxd

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: using nameserver 10.143.119.43#53

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: using nameserver 8.8.8.8#53

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: read /etc/hosts - 5 addresses

    Aug 10 14:57:33 clx-mld-43 dnsmasq-dhcp[6391]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    Aug 10 14:57:33 clx-mld-43 dnsmasq[6391]: read /etc/hosts - 5 addresses

    Aug 10 14:57:33 clx-mld-43 dnsmasq-dhcp[6391]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    Aug 10 14:57:33 clx-mld-43 systemd[1]: Started LXD - main daemon.

    and add static routing on each host by run (sample on host 43):

    $ sudo route add -net 10.10.41.0/24 gw 12.12.12.41

    $ sudo route add -net 10.10.42.0/24 gw 12.12.12.42

    $ sudo route add -net 10.10.44.0/24 gw 12.12.12.44

    $ sudo route

    Kernel IP routing table

    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

    10.10.41.0      12.12.12.41    255.255.255.0   UG   0      0        0 ib1

    10.10.42.0      12.12.12.42    255.255.255.0   UG   0      0        0 ib1

    10.10.43.0      *              255.255.255.0   U    0      0        0 lxdbr0

    10.10.44.0      12.12.12.44    255.255.255.0   UG   0      0        0 ib1

    10.143.119.0    *              255.255.255.0   U    0      0        0 enp129s0f0

    12.12.12.0      *              255.255.255.0   U    0      0        0 ib1

    Preparing LXC Container

    By default, LXD creates unprivileged containers. This means that root in the container is a non-root UID on the host. It is privileged against the resources owned by the container, but unprivileged with respect to the host, making root in a container roughly equivalent to an unprivileged user on the host. (The main exception is the increased attack surface exposed through the system call interface)

    Briefly, in an unprivileged container, 65536 UIDs are 'shifted' into the container. For instance, UID 0 in the container may be 100000 on the host, UID 1 in the container is 100001, etc, up to 165535. The starting value for UIDs and GIDs, respectively, is determined by the 'root' entry the/etc/subuid and /etc/subgid files.

    We need to request a container to run without a UID mapping by setting the security.privileged flag to true ( change it in default profile):

    $ lxc profile set default security.privileged true

    Note however that in this case the root user in the container is the root user on the host.

     

    Running verbs and RDMA-based applications on container, requires access to the host OS’s InfiniBand devices (uverbs interface). This access can granted to a container via run following command (Change default profile):

    $ lxc profile device add default uverbs1 unix-char source=/dev/infiniband/uverbs1

    Some of a host’s InfiniBand devices can be seen by checking the contents of the /dev/infiniband/ folder.

    $ sudo  ls /dev/infiniband

    issm0 issm1 rdma_cm ucm0 ucm1 umad0 umad1 uverbs0 uverbs1

    $ sudo ibdev2netdev

    mlx5_0 port 1 ==> enp5s0f0 (Down)

    mlx5_1 port 1 ==> ib0 (Up)

    In our example, there are two mlx5_ devices on the host, resulting in two ucm, umad, and uverbs interfaces in /dev/infiniband. At runtime, you choose which devices are exposed to which running containers. For our example, when running a single container, you may choose to expose second  InfiniBand to the running container.

    To show default profile run:

    $ lxc profile show default

    You should see output similar to the following:

    config:

      environment.http_proxy: ""

      security.privileged: "true"

      user.network_mode: ""

    description: Default LXD profile

    devices:

      eth0:

        nictype: bridged

        parent: lxdbr0

        type: nic

      root:

        path: /

        pool: default

        type: disk

      uverbs1:

        source: /dev/infiniband/uverbs1

        type: unix-char

    name: default

    Creating new Container

    The syntax to create is:

    lxc init images:{distro}/{version}/{arch} {container-name-here}

    To create Ubuntu 16.04 container that will use all 8 GPUs use following commands:

    $ lxc init ubuntu:16.04 c432

     

    Set static MAC address to the container:

    $ lxc config set c432 volatile.eth0.hwaddr "00:16:3e:43:01:02"

     

    That will create a new ubuntu 16.04 container as can be confirmed with:

    $ lxc list

    To push installs file to the container, use:

    $ lxc file push MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz c432/tmp/

    $ lxc file push nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb c432/tmp/                            

    $ lxc file push cuda_8.0.61_375.26_linux.run c432/tmp/                                                      

    $ lxc file push cudnn-8.0-linux-x64-v6.0.tgz c432/tmp/

    Another optional, make file sharing, mount a share directory to the container to access the installer and example files.

    $ lxc config device add c432 installs disk source=/root/installs path=/root/installs

    Starting container:

    $ lxc start c432

     

    LXD GPU passthrough

    LXD allows for specific GPU passthrough:

     

    Make sure that the following 11 files exist on our host system:

    $ ls /dev/nvidia* 

    /dev/nvidia0  /dev/nvidia2  /dev/nvidia4  /dev/nvidia6  /dev/nvidiactl   /dev/nvidia-uvm-tools

    /dev/nvidia1  /dev/nvidia3  /dev/nvidia5  /dev/nvidia7  /dev/nvidia-uvm

    Next initialize an LXD container with the Nvidia devices mounted onto the container.

    $ CONTAINER=c432
    $ lxc config device add $CONTAINER nvidia0 unix-char path=/dev/nvidia0
    $ lxc config device add $CONTAINER nvidia1 unix-char path=/dev/nvidia1
    $ lxc config device add $CONTAINER nvidia2 unix-char path=/dev/nvidia2
    $ lxc config device add $CONTAINER nvidia3 unix-char path=/dev/nvidia3
    $ lxc config device add $CONTAINER nvidia4 unix-char path=/dev/nvidia4
    $ lxc config device add $CONTAINER nvidia5 unix-char path=/dev/nvidia5
    $ lxc config device add $CONTAINER nvidia6 unix-char path=/dev/nvidia6
    $ lxc config device add $CONTAINER nvidia7 unix-char path=/dev/nvidia7
    $ lxc config device add $CONTAINER nvidiactl unix-char path=/dev/nvidiactl
    $ lxc config device add $CONTAINER nvidia-uvm unix-char path=/dev/nvidia-uvm
    $ lxc config device add $CONTAINER nvidia-uvm-tools unix-char path=/dev/nvidia-uvm-tools

    To gain login and gain shell access in the container c432 , enter:

    $ lxc exec c432 -- bash

    Installing TensorFlow in Container

     

    Required a Container Software

    Prior to install Tensorflow, the following software must be installed.

    Install General Dependencies

    1. To install general dependencies, run the commands below or paste each line.
      $ sudo apt-get install openjdk-8-jdk git build-essential python-virtualenv swig python-wheel libcupti-dev -y
    2. To install TensorFlow, you must install the following packages:
      • Numpy: A numerical processing package that TensorFlow requires
      • dev: Enables adding extensions to Python
      • pip: Enables installing and managing of certain Python packages
      • wheel: Enables management of Python compressed packages in the wheel (.whl) format
        To install these packages for Python 2.7
        $ sudo apt-get install python-numpy python-dev python-pip python-wheel -y

    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

    Install the NVIDIA Drivers on a Container

    The 367 (or later) NVIDIA drivers must be installed. To install them, you can use the Ubuntu built (when installing the additional drivers) after updating the driver packages.

    1. Go to the NVIDIA’s website (http://www.nvidia.com/download/driverResults.aspx/117079/en-us).
    2. Download the latest version of the driver. The example below uses a Linux 64-bit driver (NVIDIA-Linux-x86_64-375.51).
    3. Set the RunLevel to 3 with the program init.
      $ sudo init 3
    4. Once you accept the download please follow the steps listed below.
      $ cd /root/installs
      or
      $ cd /tmp
      $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb
      $ sudo apt-get update
      $ sudo apt-get install cuda-demo-suite-8-0 --no-install-recommends


    Verify the Installation

    Make sure the NVIDIA driver can work correctly with the installed GPU card.

    $ lsmod |grep nvidia

    4.jpg

     

    Run the nvidia-debugdump utility to collect internal GPU information.

    $ nvidia-debugdump -l

    Run the nvidia-smi utility to check the NVIDIA System Management Interface.

    $ nvidia-smi

     

    Installation Mellanox OFED on a Container

    Verify that the system has a Mellanox network adapter (HCA/NIC) installed.

    # lspci -v | grep Mellanox

     

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.

      # cd /tmp

      # tar -xzvf MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz

      # cd MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64/

    2. Run the installation script.

      # ./mlnxofedinstall --user-space-only --without-fw-update -q

    3. Check the mofed version and uverbs:

      # ofed_info -s

      MLNX_OFED_LINUX-4.1-1.0.2.0:

      # ls /dev/infiniband/

      uverbs1

    4. Run Bandwidth stress over IB in container.:

    Server

    ib_write_bw -a -R -d mlx5_1 &

    Client

    ib_write_bw -a -R -F $server_IP -d mlx5_1

    In this way you can run Bandwidth stress over IB between containers.

    Install Nvidia Toolkit 8.0 (CUDA) & CudNN on a Container

     

    Installation process same as the Install Nvidia Toolkit 8.0 (CUDA) & CudNN on a Host.

     

    Install TensorFlow

     

    Clone the TensorFlow Repository

    To clone the latest TensorFlow repository, issue the following command:

    $ cd ~
    $ git clone https://github.com/tensorflow/tensorflow

    The preceding git clone command creates a subdirectory called “tensorflow”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:

    $ cd tensorflow
    $ git checkout r1.3         # where master is the desired branch (by default)

     

    Install the Bazel Tool (last version)

    Bazel is a build tool from Google. For further information, please see http://bazel.io/docs/install.html

    1. Download and install JDK 8, which will be used to compile Bazel form source.

      $ sudo add-apt-repository ppa:webupd8team/java

      $ sudo apt-get update

      $ sudo apt-get install oracle-java8-installer

    2. Install the Bazel tool.

      $ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list

      $ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -

      $ sudo apt-get update

      $ sudo apt-get install bazel

      $ sudo apt-get upgrade bazel

     

     

    Install TensorFlow using the configure Script

    The root of the source tree contains a bash script named “configure”. This script asks you to identify the pathname of all relevant TensorFlow dependencies and specify other build configuration options such as compiler flags.

    Run the configure script prior to creating the pip package and installing TensorFlow.

    To build TensorFlow with GPU, the “configure” script needs to know the version numbers of CUDA and cuDNN. If several versions of CUDA or cuDNN are installed on your system, explicitly select the desired version instead of relying on the system default.

    $ cd ~/tensorflow            # cd to the top-level directory created
    $ ./configure

    If you received the following error message:

    locale.Error: unsupported locale setting

     

    1. Run the following command:

      export LANGUAGE=en_US.UTF-8

      export LANG=en_US.UTF-8

      export LC_ALL=en_US.UTF-8
      For further information see: http://askubuntu.com/questions/205378/unsupported-locale-setting-fault-by-command-not-found

    2. Or, edit the locale file: /etc/default/locale to:

      LANGUAGE=en_US.UTF-8
      LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
    3. Restart the computer.
      Here is an example execution of the “configure” script. In the example we configure the installation with GPU and CUDA libraries support.
      Please specify the location of python. [Default is /usr/bin/python]: Enter

      Please specify optimization flags to use during compilation when bazel
      option "--config=opt" is specified [Default is -march=native]: Enter

      Do you wish to use jemalloc as the malloc implementation? [Y/n] Y
      jemalloc enabled

      Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
      No Google Cloud Platform support will be enabled for TensorFlow

      Do you wish to build TensorFlow with Hadoop File System support? [y/N] N
      No Hadoop File System support will be enabled for TensorFlow

      Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] N
      No XLA JIT support will be enabled for TensorFlow

      Do you wish to build TensorFlow with VERBS support? [y/N] Y
      VERBS support will be enabled for TensorFlow

      Found possible Python library paths:  /usr/local/lib/python2.7/dist-packages  /usr/lib/python2.7/dist-packages
      Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]  Enter
      Using python library path: /usr/local/lib/python2.7/dist-packages

      Do you wish to build TensorFlow with OpenCL support? [y/N] N
      No OpenCL support will be enabled for TensorFlow

      Do you wish to build TensorFlow with CUDA support? [y/N] Y(Y For a Worker, N for a dedicated Parameter Server)
      CUDA support will be enabled for TensorFlow

      Do you want to use clang as CUDA compiler? [y/N] N
      nvcc will be used as CUDA compiler

      Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:  Enter

      Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0

      Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:  Enter

      Please specify the cuDNN version you want to use. [Leave empty to use system default]: 6

      Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: Enter

      Please specify a list of comma-separated Cuda compute capabilities you want to build with.
      You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
      Please note that each additional compute capability significantly increases
      your build time and binary size.[Default is: "3.5,5.2"]: 6.0 (Tesla P100 from https://developer.nvidia.com/cuda-gpus)
      Extracting Bazel installation.............
      INFO: Starting clean (this may take a while). Consider using --async if the
      clean takes more than several minutes.
      Configuration finished

     

    Pip Installation

    Pip installation installs TensorFlow on your machine, possibly upgrades previously installed Python packages. Note, this may impact existing Python programs on your machine.
    Pip is a package management system used to install and manage software packages written in Python. We provides pip packages for TensorFlow on Linux.
    This installation requires the code from Github. You can either take the most recent master branch (lots of new commits) or the latest release branch (should be more stable, but still updated every few days). Here, we use the branch master.

     

    Build the Pip Package

     

    To build a pip package for TensorFlow with GPU support, invoke the following command:

    $ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

    The bazel build command builds a script named build_pip_package. Running this script as follows will build a .whl file within the /tmp/tensorflow_pkg directory:

    $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

     

    Know issue. Now bazel 0.5.3 has problem with TensorFlow r1.3. Please install bazel 0.5.2.

     

     

    Bazel Install 0.5.2 version

     

    Uninstall bazel

    $ sudo apt-get purge bazel

    The binary installers are on Bazel's GitHub releases page.

    The installer contains the Bazel binary and the required JDK. Some additional libraries must also be installed for Bazel to work.

     

    Install required packages

    $ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip

     

    Download Bazel

    Go to Bazel's GitHub releases page.

    Download the binary installer bazel-0.5.2-installer-linux-x86_64.sh. This installer contains the Bazel binary and the required JDK, and can be used even if JDK is already installed.

    $ sudo wget https://github.com/bazelbuild/bazel/releases/download/0.5.2/bazel-0.5.2-installer-linux-x86_64.sh

    The bazel build command builds a script named build_pip_package. Running this script as follows will build a .whl file within the /tmp/tensorflow_pkg directory:

    $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

    Note that bazel-0.5.2-without-jdk-installer-linux-x86_64.sh also exist. It is a version without embedded JDK 8. Only use this installer if you already have JDK 8 installed.

     

    Run the installer

    $ sudo chmod +x bazel-0.5.2-installer-linux-x86_64.sh

    $ sudo ./bazel-0.5.2-installer-linux-x86_64.sh --user

    The --user flag installs Bazel to the $HOME/bin directory on your system and sets the .bazelrc path to $HOME/.bazelrc. Use the --help command to see additional installation options.

     

    Set up your environment

    If you ran the Bazel installer with the --user flag as above, the Bazel executable is installed in your $HOME/bin directory. It's a good idea to add this directory to your default paths, as follows:

    $ export PATH="$PATH:$HOME/bin"

    You can also add this command to your ~/.bashrc file.

     

     

    Install the Pip Package

    You will need Pip version 8.1 or later for the following commands to work on Linux. 

    $ pip install

    The filename of the .whl file depends on your platform. For example, the following command will install the pip package for TensorFlow Master based on 1.0 on Linux:

    $ sudo pip install /tmp/tensorflow_pkg/tensorflow-1.3.0rc1-cp27-cp27mu-linux_x86_64.whl


    Validate TensorFlow Installation

    To validate the TensorFlow installation:

    1. Close all your terminals and open a new terminal to test.
    2. Change directory (cd) to any directory on your system other than the tensorflow subdirectory from which you invoked the configure command.
    3. Invoke python:

      $ cd /

      $ python

      ...

      >>> import tensorflow as tf

      >>> hello = tf.constant('Hello, TensorFlow!')

      >>> sess = tf.Session()

      >>> print(sess.run(hello))

      Hello, TensorFlow!

      >>> a = tf.constant(10)

      >>> b = tf.constant(32)

      >>> print(sess.run(a + b))

      42

      >>>

      CTRL-D to EXIT.

     

    Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison

    Google published a collection of performance benchmarks that highlight TensorFlow's speed and scalability when training image classification models like InceptionV3, ResNet and VGG16.

    Here we will provide our performance benchmark results for InceptionV3 and ResNet-50 over TCP and RDMA.

    Benchmarks ran using both real and synthetic data. We believe it is important to include real data (ImageNet 2012 DataSet) measurements when benchmarking a platform.

    Testing with synthetic data was done by using a tf.Variable set to the same shape as the data expected by each model for ImageNet.

    This load tests both the underlying hardware and the framework at preparing data for actual training.

    We start with synthetic data to remove disk I/O as a variable and to set a baseline. Real data is then used to verify that the TensorFlow input pipeline and the underlying disk I/O are saturating the compute units.

    Server's hardware and configurations used for TCP and IB RDMA benchmarks are identical.

     

    Details for our benchmarks

    Environment

    • Instance type: See setup overview
    • GPU: 8x NVIDIA® Tesla® P100
    • OS: Ubuntu 16.06 LTS with tests run via Docker
    • CUDA / cuDNN: 8.0 / 6.0
    • TensorFlow GitHub : r1.3
    • Benchmark GitHub hash: b922111
    • Build Command: bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
    • Disk: Local NVMe
    • DataSet: ImageNet 2012
    • Test Date: Aug 2017

     

    The batch size and optimizer used for the tests are listed in the table.

     

    OptionsInceptionV3ResNet-50
    Batch size per GPU6464
    Optimizersgdsgd

     

     

    Configuration used for each model.

     

    Modelvariable_updatelocal_parameter_devicecross_replica_sync
    InceptionV3distributed_replicatedn/aTrue
    ResNet-50distributed_replicatedn/aTrue
    ResNet-152distributed_replicatedn/aTrue

     

     

    The server setup for the runs is included 4 worker servers and was explained in the setup overview part of the document.

     

    Results

    Inception V3

    Training synthetic data

     

    ResNet-50

    Training synthetic data

     

    ResNet-152

    Training synthetic data

    Methodology

     

    This script was run to generate the above results.

     

    In order to create results that are as repeatable as possible, each test was run 3 times and then the times were averaged together. GPUs are run in their default state on the given platform. For each test, 10 warmup steps are done and then the next 100 steps are averaged.

     


    Appendix B: Common Installation Issues

     

    The installation issues that might occur typically depend on the installed Operating System. For further information, please see the "Common installation problems" Installing TensorFlow on Linux guide.
    Beyond the errors documented in the guide above, the following table specifies additional errors specific to building TensorFlow. Note that we are relying on Stack Overflow as the repository for build and installation problems. If you encounter an error message not listed in the preceding guide or in the following table, search for it on Stack Overflow. If Stack Overflow does not show the error message, ask a new question on Stack Overflow and specify the tensorflow tag.

     

    Stack Overflow LinkError Message
    42013316

    ImportError: libcudart.so.8.0: cannot open shared object file:

    No such file or directory

    42013316

    ImportError: libcudnn.6: cannot open shared object file:

    No such file or directory

    35953210

    Invoking `python` or `ipython` generates the following error:

    ImportError: cannot import name pywrap_tensorflow