Reference Deployment Guide for RoCE accelerated TensorFlow with NVIDIA GPU running on top of RH OSP 11 over Mellanox Network

Version 4

    Before reading this post, make sure you are familiar with Redhat Openstack 11.0 installation procedures.

     

    Before starting to use the Mellanox switches, we recommend that you upgrade the switches to the latest Mellanox OS version.

    Purpose

    This document helps you to evaluate the Tensor Flow over RedHat Openstack Platform 11. It focuses on benefits of using Mellanox components, allowing to increase performance and scalability of the solution.

    Definitions/Abbreviation

    Table 1: Abbreviation

    Definitions/AbbreviationDescription
    RDMAA communications protocol that provides transmission of data from the memory of one computer to the memory of another without involving the CPU
    iSERiSER stands for "iSCSI Extensions for RDMA".  It is an extension of the data transfer model of iSCSI, a storage networking standard for TCP/IP. It uses the iSCSI components while taking the advantage of the RDMA protocol suite.
    ConnectX-4ConnectX-4 adapter cards with Virtual Protocol Interconnect (VPI), supporting EDR 100Gb/s InfiniBand and 100Gb/s Ethernet connectivity, provide the highest performance and most flexible solution for high-performance, Web 2.0, Cloud, data analytics, database, and storage platforms.
    VLAN Virtual LAN (VLAN) is a network virtualization technology that attempts to improve the scalability problems associated with large cloud computing deployments.
    PCI passtroughWith PCI pass-through, you can assign a PCI device directly to one guest operating system. When you use PCI pass-through, the PCI device becomes unavailable to the host and to all other guest operating systems

    Related links

    Table 2: Related Documentation

     

    Introduction

    Mellanox components overview

    In this solution Mellanox brings both hardware and software components.

      • Mellanox ConnectX-4 / ConnectX-4 Lx / ConnectX-5 NIC brings SR-IOV functionality resulting in performance improvement.
      • Mellanox plugin simplifies Openstack deployment process.
      • Mellanox SN2700 / SN2100 / SN2410 switches provide reliable and high performance Ethernet solution

     

    Main highlights of this example

    • Mellanox SN2700 switches (32 x 100Gb ports)
    • Mellanox ConnectX-4 dual-port NIC for 100Gb uplinks from hosts
    • Tenant and storage networks are placed on different physical networks to improve security and performance
    • Tensor Flow doesn't use storage, but RHOSP 11 deployment without storage needs some deep and tricky customization. So this solution describes deployment of high performance Cinder storage boosted with iSER (iSCSI over RDMA) to obtain maximum performance from disk subsystem and reduce CPU overhead while data transfer.
    • All cloud nodes connected to Local, External (Public), Tenant, and Storage networks
    • Undercloud is running on Red Hat Director VM on Deployment server

     

     

    Setup diagram

    The following illustration shows the example configuration.

    Figure 1: Setup diagram

     

    Solution configuration

    Setup hardware requirements (example)

    Table 3: Setup hardware requirements

    ComponentQuantityRequirements
    Deployment server1 x

    Select a non-high performing server to run RedHat Director VM:

     

    CPU: Intel E5-26xx or later model

    HD: 100 GB or larger

    RAM: 32 GB or more

    NICs: 2 x 1Gb

    Note: Intel Virtualization technologies must be enabled in BIOS.

    Cloud Controllers servers:

    3 x

    Strong servers to run Cloud control workload that meet the following:

     

    CPU: Intel E5-26xx or later model

    HD: 450 GB or larger

    RAM: 128 GB or more

    NICs:

    • 2 x 1Gb
    • Mellanox ConnectX-4 dual port (MCX456A-ECAT)  NIC
    Note: Intel Virtualization technologies must be enabled in BIOS
    Cloud compute servers:3 x

    Strong servers to run Tenant’s VM workload that meet the following:

     

    CPU: Intel E5-26xx or later model

    HD: 450 GB or larger

    RAM: 128 GB or more

    GPU: 2 x Nvidia K40m or compatible

    NICs:

    • 2 x 1Gb
    • Mellanox ConnectX-4 dual port (MCX456A-ECAT)  NIC
    Note: Intel Virtualization technologies must be enabled in BIOS
    Cloud Storage server (iSER)
    1 x

    A high IO performance strong server to act as iSER backend featuring:

     

    CPU: Intel E5-26xx or later model

    HD:

    • OS: 100 GB or larger
    • Cinder Volume: SSD drives configured in RAID-10 for best performance

    RAM: 64 GB or more

    NICs:

    • 2 x 1Gb
    • Mellanox ConnectX-4 dual port (MCX456A-ECAT)  NIC
    Private, Storage, management networks switch2 xMellanox SN2700 100Gb/s switch
    Local network1 x1Gb L2 switch
    External network1 x1Gb L2 switch
    Cables

    24x

    14 x

    1Gb CAT-6e for Local, External and IPMI networks

    100Gb EDR copper cables up to 2m (MCP1600-CXXX)

     

    Server configuration

    There are several prerequisites for cloud hardware to work.

    Please go to the BIOS of each node and make sure, that :

    • CPU virtualization is enabled on all nodes, including Deployment server (Intel VT or AMD-V, depending on your CPU type)
    • All nodes except Deployment should be configured to boot from NIC connected to Local network
    • Path between Local and IPMI networks should be routable. This is required to allow Undercloud to access cloud server's IPMI.

     

    Physical network setup

    1. Connect all nodes to the Local 1GbE switch, preferably through the eth0 interface on board.
    2. Connect all nodes to the External (Public) 1GbE switch, preferably through the eth1 interface on board.
    3. Connect both ports of ConnectX-4 of all nodes (except deployment server) to both Mellanox SN2700 switches.
      Warning: The interface names (eth0, eth1, p2p1, etc.) should be equal on all cloud nodes.
      We recommend to use same server model as interface names can vary between different models.

      Rack Setup Example


      Figure 2: Rack setup wiring example

    Deployment server

    Figure 3: Deployment server wiring example

     

    Compute and Controller nodes

    Figure 4: Compute / Controller node wiring example

     

    Storage node

    Connecting similar to Compute and Controller nodes.

    Figure 5: Storage node wiring example

     

    Network Switch Configuration

    Note: Refer to the MLNX-OS User Manual to become familiar with switch software (located at support.mellanox.com).
    Note: Before starting to use of the Mellanox switch, we recommend that you upgrade the switch to the latest MLNX-OS version.

    1. Login to the switch by ssh using its IP address
      # ssh admin@switch_ip
      Default password is admin
    2. Enable configuration mode
      switch > enable
      switch # configure terminal
      switch (config) #
    3. Enable flow control support (required for RDMA).
      On storage switch run the following:
      switch [standalone: master] (config) # dcb priority-flow-control enable force
    4. Configure VLAN range for Cloud tenant networks on Tenant switch
      switch (config) # vlan 1-4000
      switch (config vlan 1-4000) # exit
      switch (config) # interface ethernet 1/1 switchport mode hybrid
      switch (config) # interface ethernet 1/1 switchport hybrid allowed-vlan all
      switch (config) # interface ethernet 1/2 switchport mode hybrid
      switch (config) # interface ethernet 1/2 switchport hybrid allowed-vlan all
      ...

      switch (config) # interface ethernet 1/32 switchport mode hybrid
      switch (config) # interface ethernet 1/32 switchport hybrid allowed-vlan all
    5. Enabling Flow Control on Storage switch:
      Flow control is required when running RDMA over RoCE - Ethernet.

      Run the following command on Storage switch to enable flow control  (on all ports in this example):

      switch (config) # interface ethernet 1/1-1/32 flowcontrol receive on force
      switch (config) # interface ethernet 1/1-1/32 flowcontrol send on force

      To save settings, run on all Mellanox switches:
      switch (config) # write memory

     

    Networks Allocation (Example)

    The example in this post is based on the network allocation defined in this table:

    Table 4: Network allocation example

    NetworkSubnet/MaskGatewayNotes
    Local network172.21.1.0/24N/A

    The network is used to provision and manage Cloud nodes by the Undercloud. The network is enclosed within a 1Gb switch and should have routing to IPMI network

    Storage172.18.0.0/24N/AThis network is used to provide storage services
    External (Public)10.7.208.0/2410.7.208.1This network is used to connect Cloud nodes to an external network. Neutron L3 is used to provide Floating IP for tenant VMs. Both networks are represented by IP ranges within same subnet with routing to external networks.
    All Cloud nodes will have Public IP address. In additional you shall allocate 2 more Public IP addressees ·
    • One IP required for HA functionality
    • Virtual router requires an additional Public IP address
    We do not use a virtual router in our deployment but still need to reserve Public IP address for it. So Public Network range is an amount of cloud nodes + 2. For our example with 8 Cloud nodes, so we need 10 IPs in the Public network range.

    Note: Consider a larger range if you plan to add more servers to the cloud later.

    In our build we use 10.7.208.125 >> 10.7.208.148 IP range for both Public and floating ranges.

    IP allocation will be as follows:

    • Deployment server: 10.7.208.125
    • RedHat Director VM: 10.7.208.126
    • Public Range: 10.7.208.127 >> 10.7.208.136 (7 used for physical servers, one reserved for HA and one reserved for virtual router)
    • Floating Range: 10.144.254.137 >> 10.144.254.148 (used for Floating IP pool)

     

    The scheme below illustrates our setup Public IP allocation.

    Figure 6: Public range IP allocation

     

    Environment preparation

    Install the Deployment server

    In our setup we install the CentOS release 7.3 64-bit distribution. We use the CentOS-7-x86_64-Minimal-1511.iso image and installed the minimal configuration. We will install all missing packages later.

    Two 1Gb interfaces are connected Local and External:

    • em1 (first interface of integrated NIC) is connected to Local and configured statically.The configuration will not actually be used, but will save time on bridge creation later.
        • IP: 172.21.1.1
        • Netmask: 255.255.255.0
        • Gateway: N/A
    • em2 (second interface of integrated NIC) is connected to External and configured statically:
        • IP: 10.7.208.125
        • Netmask: 255.255.255.0
        • Gateway: 10.7.208.1

     

    Configure the Deployment server to run the RedHat Director VM

    Login to the Deployment server by SSH or locally and perform actions listed below:

    1. Disable the Network Manager.
      # sudo systemctl stop NetworkManager.service
      # sudo systemctl disable NetworkManager.service
    2. Install packages required for virtualization.
      # sudo yum install qemu-kvm libvirt libvirt-python libguestfs-tools virt-install virt-manager
    3. Install packages required for x-server.
      # sudo yum install xorg-x11-xauth xorg-x11-server-utils xclock xorg-x11-fonts-*
    4. Reboot the Deployment server.

     

    Create and configure the new VM for RedHat Director

    1. Start the virt-manager.
      # virt-manager
    2. Create a new RedHat VM using the virt-manager wizard
    3. During the creation process provide VM with at least 8 cores, 16GB of RAM and 40GB disk.
    4. Mount the RedHat installation disk to VM virtual CD-ROM device.
    5. Configure the network so the RedHat VM has two NICs connected to Local and External networks.
      1. Use virt-manager to create bridges:
        1. br-em1, the Local network bridge used to connect RedHat VM's eth0 to Local network.
        2. br-em2, the External network bridge used to connect RedHat VM's eth1 to External network.
      2. Connect the RedHat VM eth0 interface to br-em1.
      3. Add to the RedHat VM eth1 network interface and connect it to br-em2.

        Figure 7: Bridges configuration schema

        Note: You can define any other names for bridges. In this example names were defined them to match names of the physical interfaces that are connected to the Deployment server.
    6. Save settings

    RedHat VM installation

    1. Download RHEL 7.3 installation ISO
    2. Install OS using installation wizard in MINIMAL configuration
    3. During OS installation, please assign External (Public) and Local IPs to VM interfaces:
      • eth0 assigned with Local IP 172.21.1.2 (mask 255.255.255.0)
      • eth1 assigned with External IP 10.7.208.126 (mask 255.255.255.0)

    Prepare VM for installing Undercloud

    1. Creating a stack user
      # useradd stack
      # passwd stack
      # echo "stack ALL=(root) NOPASSWD:ALL" | tee -a /etc/sudoers.d/stack
      # chmod 0440 /etc/sudoers.d/stack
      # su - stack
    2. Create stack user home dir to simplify undercloud installation:
      1. Creating new home directory
        $ sudo mkdir /deploy_directory
        $ sudo chown stack:stack /deploy_directory
      2. Changing the home user directory from /home/stack to new path:
        $ sudo sed -i 's/\/home\/stack/\/deploy_directory/' /etc/passwd
      3. Copy bash scripts from home.
        $ cp /home/stack/.bash* /deploy_directory/
    3. Setting hostname for the system
      $ sudo hostnamectl set-hostname <hostname.hostdomain>
      $ sudo hostnamectl set-hostname --transient <hostname.hostdomain>
    4. Edit the “/etc/hosts” file:
      127.0.0.1   hostname.hostdomain hostname localhost localhost.localdomain localhost4 localhost4.localdomain4
    5. Registering the system at RedHat using valid RedHat account
      $ subscription-manager register --username RHEL_USER --password RHEL_PASS --auto-attach
    6. Enabling required packages
      $ sudo subscription-manager repos --disable=*
      $ sudo subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-rh-common-rpms --enable=rhel-ha-for-rhel-7-server-rpms --enable=rhel-7-server-openstack-11-rpms
      $ sudo yum update –y
    7. Reboot the system

     

    Attention: after reboot all next steps should be performed as stack user

     

    Cloud deployment

    Installing the Undercloud

    1. Login to RedHat VM as root and became stack user
      # su - stack
    2. Creating directories for the images and templates
      $ mkdir ~/images
      $ mkdir ~/templates
    3. Installing the Director packages
      $ sudo yum install -y python-tripleoclient
    4. Create Undercloud configuration file from template

      $ cp /usr/share/instack-undercloud/undercloud.conf.sample ~/undercloud.conf
    5. Edit the parameters according to example below
      $ vi ~/undercloud.conf
      For the minimal configuration of undercloud we can leave all the commented lines as is and just add the following.
      See example below:

      [DEFAULT]
      undercloud_public_host=10.7.208.126
      enable_ui=true
      generate_service_certificate=true
      local_interface=eth0

    6. Installing the undercloud
      $ openstack undercloud install
      Import parameters and variables to use the undercloud commands line tools
      $ source ~/stackrc

    Preparing and configuring the Overcloud installation.

    1. Obtaining the overcloud images and uploading them
      $ sudo yum install rhosp-director-images rhosp-director-images-ipa
      $ cd ~/images
      $ tar -xvf /usr/share/rhosp-director-images/overcloud-full-latest-11.0.tar
      $ tar -xvf /usr/share/rhosp-director-images/ironic-python-agent-latest-11.0.tar
      $ openstack overcloud image upload --image-path ~/images/
    2. Setting a nameserver on the undercloud neutron subnet
      $ neutron subnet-update `neutron subnet-list | grep '|'| tail -1 |  cut -d'|' -f2` --dns-nameserver 8.8.8.8
    3. Obtaining  the basic TripleO heat template
      $ cp -r /usr/share/openstack-tripleo-heat-templates ~/templates
    4. Registering ironic nodes
      Prepare the json file ~/instackenv.json to be like example below and include all cloud nodes.
      Assign desired roles by assigning proper PROFILE to server.
      Profiles used in this deployment are: control, compute, block-storage

      {
          "nodes":[
              {
                  "mac":[
                  ],
                  "pm_type":"pxe_ipmitool",
                  "pm_user":"ADMIN",
                  "pm_password":"ADMIN",
                  "pm_addr":"192.0.2.205"
                  "capabilities": "profile:PROFILE,boot_option:local"
              },
              {
                  "mac":[
                  ],
                 "pm_type":"pxe_ipmitool",
                  "pm_user":"ADMIN",
                  "pm_password":"ADMIN",
                  "pm_addr":"192.0.2.206"
                  "capabilities": "profile:PROFILE,boot_option:local"

              }
          ]
      }


    5. Import the json file to the director and start the introspection process
      This process turns servers on, inspects the hardware and turns them off.
      It can take 10 - 30 minutes depending on amount of nodes and their performance.

      $ openstack baremetal import --json ~/instackenv.json

      $ openstack baremetal configure boot

      $ openstack baremetal introspection bulk start

    6. To see provisioned nodes list use command:

      $ openstack baremetal node list


      Figure 8: List of nodes, ready for deployment
      If state of nodes is same as on screenshot, they are ready to proceed

      • Power state should be Power off

      • Provisioning state should be available
      • Maintenance should be false
    7. Check which ID belongs to which server:
      1. Identify nodes UUID by their IPMI IP address. For that run the script:

        $ for id in `openstack baremetal node list|grep None|cut -d "|" -f 2`; do ip=`openstack baremetal node show $id |grep addr|cut -d ":" -f 3|cut -d "'" -f 2`; echo $id " " $ip; echo " "; done

        Figure 9: List of node's IDs and IPs
      2. Validate that profile assignment passed normaly
        $ openstack overcloud profiles list
      3. If you want to change assignment manually, use command below:
        $ openstack baremetal node set --property capabilities='profile:PROFILE,boot_option:local' 7dcf575e-b18c-4bab-b259-fb936b3fcfae
        Profiles to be used in our deployment: control, compute, block-storage
      4. Define the physical disk for OS installation
        $ sudo yum install crudini
        $ mkdir ~/swift-data
        $ cd ~/swift-data
        $ export SWIFT_PASSWORD=`sudo crudini --get /etc/ironic-inspector/inspector.conf swift password`
        $ for node in $(ironic node-list | grep -v UUID| awk '{print $2}'); do swift -U service:ironic -K $SWIFT_PASSWORD download ironic-inspector inspector_data-$node; done
    8. Download and Extract Mellanox Plugin
      Please download plugin from http://www.mellanox.com/downloads/Software/mellanox-rhel-osp-3.1.0.tar.gz and save it to home folder
      $ cd ~
      $ wget http://www.mellanox.com/downloads/Software/mellanox-rhel-osp-3.1.0.tar.gz 
      Unpack downloaded package to stack user's home directory
      $  tar -zxvf mellanox-rhel-osp-3.1.0.tar.gz -C ~/
      The package folder will be extracted under the HOME directory under mellanox-rhel-osp.
    9. Check available interfaces and their drivers of all registered ironic nodes
      As result the ~/mellanox-rhel-osp/deployment_sources/environment_conf/interfaces_drivers.yaml file will be created with list of network interfaces details
      $ cd ~/mellanox-rhel-osp/deployment_sources/
      $ ./summarize_interfaces.py --path_to_swift_data ~/swift-data/
      This step is needed to summaries the interfaces names and their driver. It is also required for interface renaming in the case that the nodes are not homogeneous.
    10. Edit the config.yaml file with the parameters and flavor required to create the env template file (~/deployment_sources/environment_conf/config.yaml). This is the main file to configure before deploying TripleO over Mellanox NICs. After configuring this file, TripleO basic configurations file will be generated in this directory.
      You can use this file or further customize it before starting the deployment.
      Note: Pay attention to use correct interface name for iSER configuration. Consult with the ~/mellanox-rhel-osp/deployment_sources/environment_conf/interfaces_drivers.yaml file.
      $ vi ~/mellanox-rhel-osp/deployment_sources/environment_conf/config.yaml
      Sriov:
          sriov_enable: true
          number_of_vfs:
          sriov_interface: 'ens2f1'             # The name of ConnectX-4 port for SR-IOV

      Neutron:
          physnet_vlan: 'default_physnet'
          physnet_vlan_range: '2:4000'          # VLAN range will be used for Tenant networks
          network_type: 'vlan'

      Storage:
          backend: cinder_lvm                   # Choose Cinder as storage backend
          rdma_enable: true                     # Enable usage of RDMA

          storage_network_port: 'ens2f0'        # The name of ConnectX-4 port for Storage network

      Bond:
          bond_enable: false

      Dpdk:
          dpdk_enable: false
          dpdk_interface: 'ens2f1'

      MofedInstallation: false

      interfacesValidation: true
    11. Generate environment template file to be used in the overcloud deployment:
      $ python ~/mellanox-rhel-osp/deployment_sources/create_conf_template.py

      This script uses the interfaces_driver.yaml (generated from step 9 and config.yaml files to generate the following file:

      ~/mellanox-rhel-osp/deployment_sources/environment_conf/env.yaml

      that contains the flavor of the Overcloud deployment

    12. Generate network file from template ~/mellanox-rhel-osp/deployment_sources/network/network_env-template.yaml
      $ python ~/mellanox-rhel-osp/deployment_sources/create_network_conf.py
    13. Modify the ~/mellanox-rhel-osp/deployment_sources/network/network_env.yaml to match your environment network settings and new names used in interfaces_drivers yaml file if added.
      It contains information where to take all configuration files, required to procceed with deployment.

      In below example changes are highlighted by red. You may need to change other parameters.

      $ vi ~/mellanox-rhel-osp/deployment_sources/network/network_env.yaml
      parameter_defaults:
        ControlPlaneDefaultRoute: 192.168.24.1
        ControlPlaneSubnetCidr: '24'
        DnsServers:
        - 8.8.8.8
        - 8.8.4.4
        EC2MetadataIp: 192.168.24.1
        ExternalAllocationPools:
        - end: 10.7.208.64
          start: 10.7.208.55
        ExternalInterface: eno2
        ExternalInterfaceDefaultRoute: 10.7.208.1
        ExternalNetCidr: 10.7.208.0/24
        NeutronExternalNetworkBridge: br-ex
        NtpServer: time.nist.gov
        ProvisioningInterface: eno1
        StorageAllocationPools:
        - end: 172.18.0.200
          start: 172.18.0.10
        StorageInterface: ens2f0
        StorageNetCidr: 172.18.0.0/24
        TenantAllocationPools:
        - end: 172.16.0.200
          start: 172.16.0.10
        TenantInterface: ens2f1
        TenantNetCidr: 172.16.0.0/24
    14. Enable nodes registration at Redhat
      # vi ~/mellanox-rhel-osp/deployment_sources/environment_conf/environment-rhel-registration.yaml
      Please fill in fields marked with red with registration data, obtained from RedHat
      parameter_defaults:
        rhel_reg_activation_key: ACTIVATION_KEY
        rhel_reg_auto_attach: ""
        rhel_reg_base_url: ""
        rhel_reg_environment: ""
        rhel_reg_force: ""
        rhel_reg_machine_name: ""
        rhel_reg_org: ORGANIZATION
        rhel_reg_password: ""
        rhel_reg_pool_id: ""
        rhel_reg_release: ""
        rhel_reg_repos: rhel-7-server-rpms rhel-7-server-extras-rpms rhel-7-server-rh-common-rpms rhel-ha-for-rhel-7-server-rpms rhel-7-server-openstack-11-rpms
        rhel_reg_sat_url: ""
        rhel_reg_server_url: ""
        rhel_reg_service_level: ""
        rhel_reg_user: ""
        rhel_reg_type: ""
        rhel_reg_method: "portal"
        rhel_reg_sat_repo: ""
    15. Prepare overcloud image from the folder: ~/mellanox-rhel-osp /deployment_sources
      [stack@vm deployment_sources]$ python prepare_overcloud_image.py
      --iso_ofed_path ISO_OFED_PATH
      --overcloud_images_path OVERCLOUD_IMAGES_PATH
      --mlnx_package_path MLNX_PACKAGE_PATH
      --root_password ROOT_PASSWORD

      Example of the command:

      $ ./prepare_overcloud_image.py  --overcloud_images_path ~/images/ --root_password <pass>
    16. After building overcloud image, introspection should be performed again.
      Execute commands, listed below:
      $ openstack baremetal configure boot ; openstack baremetal introspection bulk start

    Deploying the Overcloud.

    1. Find and edit file deploy.sh in the folder ~/mellanox-rhel-osp/deployment_sources
      $ cd ~/mellanox-rhel-osp/deployment_sources
      $ vi deploy.sh
    2. Adopt control-scale, compute-scale and ceph--storage-scale parameters to match your amount of nodes.
      Chang
      e "-e ./environment_conf/env-template.yaml \" line to "-e ./environment_conf/env.yaml \"
      #!/bin/bash
      . ~/stackrc
      rm -fv ~/.ssh/known_hosts*

      # Please run this script inside the package folder
      openstack \
        overcloud deploy \
        --stack ovcloud \
        --control-scale         3 \
        --compute-scale         2 \
        --block-storage-scale   1 \
        --control-flavor control \
        --compute-flavor compute \
        --block-storage-flavor block-storage \
        --ntp-server time.nist.gov \
        --templates    \
        -e environment_conf/environment-rhel-registration.yaml \
        -e ./environment_conf/env.yaml \

        -e ./network/network_env.yaml \
        -e ./environment_conf/multiple_backends.yaml \
      $@

      Warning:Keep in mind, that if you comment any line in this file, all not commented lines placed below will be ignored. Please keep all commented lines at the bottom of the file.

    3. Run the script:
      $ ./deploy.sh

     

    Post-deployment network configuration

    Openstack is deployed and is fully operational but there are no virtual network created yet.

    You can create networks manually from Horizon of use below CLI scripts:

     

    For Horizon option login to Horizon as user admin with password located in /deploy_directory/mellanox-rhel-osp/deployment_sources/ovcloudrc file on Director VM as variable OS_PASSWORD

    $ grep OS_PASSWORD  /deploy_directory/mellanox-rhel-osp/deployment_sources/ovcloudrc  | cut -d"=" -f 2

     

    For CLI option you need to login to Director VM as user stack and perform below steps:

    1. Defining environment

      Note: Please do not forget to update those variables to values, proper for your network

      $ source ~/mellanox-rhel-osp/deployment_sources/ovcloudrc
      $ TENANT="admin"
      $ TENANT_ID=$(openstack project list | awk "/\ $TENANT\ / { print \$2 }")
      $ TENANT_NET_CIDR="192.168.1.0/24"
      $ TENANT_NET_GW="192.168.1.1"
      $ TENANT_NET_CIDR_EXT="10.7.208.0/24"
      $ IP_EXT_FIRST=10.7.208.136
      $ IP_EXT_LAST=10.7.208.148
    2. Create the networks with VXLAN

      1. Create internal tenant network

        $ openstack network create "$TENANT-net" --provider-network-type vlan

        $ TENANT_NET_ID=$(openstack network list | grep "$TENANT"|grep -v -i ext|awk '{print $2}')

      2. Create tenant network for Floating IPs
        $ openstack network create "$TENANT-net-ext" --provider-physical-network datacentre --provider-network-type flat --external
        $ TENANT_NET_EXT_ID=$(openstack network list | grep "$TENANT"| grep -i ext| awk '{print $2}')
    3. Create the subnet and get the ID

      1. Create internal tenant subnet

        $ openstack subnet create "$TENANT-subnet" --network "$TENANT-net" --subnet-range $TENANT_NET_CIDR
        $ TENANT_SUBNET_ID=$(openstack subnet list|grep -i $TENANT |grep -iv ext |cut -d "|" -f 2)

      2. Create tenant subnet for Floating IPs

        $ openstack subnet create "$TENANT-subnet-ext" --network "$TENANT-net-ext" --allocation-pool start=$IP_EXT_FIRST,end=$IP_EXT_LAST --subnet-range $TENANT_NET_CIDR_EXT

        $ TENANT_SUBNET_EXT_ID=$(openstack subnet list|grep $TENANT |grep ext |cut -d "|" -f 2)
    4. Create an HA Router and get the ID

      $ openstack router create "$TENANT-extnet"

      $ ROUTER_ID=$(openstack router list |grep $TENANT| cut -d "|" -f 2)

    5. Set the gw for the new router
      $ openstack router set $ROUTER_ID --external-gateway $TENANT_NET_EXT_ID
    6. Add a new interface in the main router
      $ openstack router add subnet $ROUTER_ID $TENANT_SUBNET_ID

    To automate the process you can create the script, make it executable and run it:

    1. Create a script
      $ vi ~/create_networks.sh
    2. Copy-past commands below to that script

      #!/bin/bash
      source ~/mellanox-rhel-osp/deployment_sources/ovcloudrc

       

      TENANT="admin"
      TENANT_ID=$(openstack project list | awk "/\ $TENANT\ / { print \$2 }")
      TENANT_NET_CIDR="192.168.1.0/24"
      TENANT_NET_GW="192.168.1.1"
      TENANT_NET_CIDR_EXT="10.7.208.0/24"
      IP_EXT_FIRST=10.7.208.136
      IP_EXT_LAST=10.7.208.148

       

      # Create the networks with VXLAN

      # Create internal tenant network

      openstack network create "$TENANT-net" --provider-network-type vlan

      TENANT_NET_ID=$(openstack network list | grep "$TENANT"|grep -v -i ext|awk '{print $2}')

       

      # Create tenant network for Floating IPs

      openstack network create "$TENANT-net-ext" --provider-physical-network datacentre --provider-network-type flat --external

      TENANT_NET_EXT_ID=$(openstack network list | grep "$TENANT"| grep -i ext| awk '{print $2}')

       

      # Create the subnet and get the ID

      # Create internal tenant subnet

      openstack subnet create "$TENANT-subnet" --network "$TENANT-net" --subnet-range $TENANT_NET_CIDR

      TENANT_SUBNET_ID=$(openstack subnet list|grep -i $TENANT |grep -iv ext |cut -d "|" -f 2)

       

      # Create tenant subnet for Floating IPs

      openstack subnet create "$TENANT-subnet-ext" --network "$TENANT-net-ext" --allocation-pool start=$IP_EXT_FIRST,end=$IP_EXT_LAST --subnet-range $TENANT_NET_CIDR_EXT

      TENANT_SUBNET_EXT_ID=$(openstack subnet list|grep $TENANT |grep ext |cut -d "|" -f 2)

       

      # Create a Router and get the ID

      openstack router create "$TENANT-extnet"

      ROUTER_ID=$(openstack router list |grep $TENANT| cut -d "|" -f 2)

       

      # Set the gw for the new router

      openstack router set $ROUTER_ID --external-gateway $TENANT_NET_EXT_ID

       

      # Add a new interface in the main router

      openstack router add subnet $ROUTER_ID $TENANT_SUBNET_ID

    3. Make it executable

      $ sudo chmod +x ~/create_networks.sh
    4. Run it
      $ sh ~/create_networks.sh

     

    Configuring GPU PassThrough

    As Tensor Flow uses GPU devices so we need to provide VM with access to GPU resources.

    By default VM does not have access to hypervisor's hardware resources. To provide such access we will use PCI PassThrough feature.

    For that we need to configure several components: compute nodes kernel parameters in grub.conf and Nova configuration on controller and compute nodes.

    Figure 10: Logical schema of TensorFlow VMs

     

    Configuring Compute Nodes

      1. On each compute check device vendor and product IDs
        Login to compute node and run:
        # lspci -nn | grep -i nvidia
        03:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)

        83:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
        Here are Vendor ID and Product ID
      2. Validate which driver is used for GPU

        # lspci -nnk -d 10de:1023 |grep -A 1 -i nvidia
        03:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)

            Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
            Kernel driver in use: nouveau
        83:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
            Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
            Kernel driver in use: nouveau

        Driver in use is nouveau
      3. As nouveau driver can't be used for PassThrough, we need to use vfio-pci instead.
        For that we need to add respective parameter to kernel on boot.

        # vi /etc/default/grub
        Find line GRUB_CMDLINE_LINUX
        Add to the end:
        rd.driver.blacklist=nouveau pci-stub.ids=10de:1023
        As result line should look like this:
        GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=12 rd.driver.blacklist=nouveau pci-stub.ids=10de:1023"
      4. Update existing grub configuration and reboot
        # sudo grub2-mkconfig -o /boot/grub2/grub.cfg
        # reboot
      5. Make sure that proper driver is in use after node was restarted
        # lspci -nnk -d 10de:1023 |grep -A 1 -i nvidia
        03:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
            Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
            Kernel driver in use: vfio-pci
        83:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
            Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
            Kernel driver in use: vfio-pci

        Now proper driver vfio-pci is in use

      6. After proper driver is configured, it is time to configure Nova

        On each compute edit nova.conf:
        # vi /etc/nova/nova.conf

        Find parameter

        passthrough_whitelist=[{"devname": "ens13f1", "physical_network": "default"}]

        Update it according to this:

        passthrough_whitelist=[{"devname": "ens13f1", "physical_network": "default"}, {"vendor_id": "10de", "product_id": "1023" }]

        Save

      7. Restart nova-compute service
        # systemctl restart openstack-nova-compute.service

     

    Configuring Controller nodes

      1. First we need to create alias, that will be used in flavor.
        On each controller node edit nova.conf
        # vi /etc/nova/nova.conf
        Add parameter:
        alias = { "vendor_id":"10de", "product_id":"1023", "device_type":"type-PCI", "name":"nvidia" }
      2. Now we need to configure Nova Scheduler
        Find parameters available_filters and enabled_filters
        Make them look like this:
        available_filters=nova.scheduler.filters.all_filters
        available_filters=nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter
        enabled_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter
        Now Nova Scheduler will know that there are devices available for PCI passthrough.
      3. Now all nova services should be restarted
        # for i in `systemctl -a | grep -v inactive |grep nova|awk {'print $1'}`; do service $i restart; echo $i; done

     

    VM image preparation

    To prepare image with Ubuntu 16.04 please refer to Openstack Virtual Machine Image Guide

    1. Bring VM to Openstack
      1. Copy prepared ubuntu.qcow2 image to Director node using SCP.
        On the KVM node run:
        # scp ./ubuntu.qcow2 root@10.7.208.126:/~
        Please don't forget to replace Director's IP with your own.
      2. Upload image to the glance. On Director Node run:
        # su - stack
        # source ~/mellanox-rhel-osp/deployment_sources/ovcloudrc
        # sudo mv /root/ubuntu.qcow2 ~/
        # glance image-create --name ubuntu --disk-format=qcow2 --container-format=bare --file=~/ubuntu.qcow2

     

    Preparing VMs to run TensorFlow

    To build TensorFlow, you will need 2 Worker VMs that require SR-IOV + PCI PassThrough

    1. Create flavor with PCI PassThrough enabled for worker VMs
      # openstack flavor create --vcpus 24 --ram 65536 --disk 25 --property "pci_passthrough:alias"="nvidia:2" m1.worker
    2. Check ID of tenant network, created before and assign it to variable
      # net_id=$(openstack network list|grep adm|grep -iv ext|cut -d "|" -f 2); echo $net_id
    3. Create new direct port in that network and check its ID
      # port_id=`neutron port-create $net_id --name sriov_port --binding:vnic_type direct | grep "\ id\ " | awk '{ print $4 }'`; echo $port_id
    4. Bring up Worker VM using just created PortID
      # nova boot --flavor m1.worker --image ubuntu --nic port-id=$port_id WORKER-1
    5. Check that SR-IOV and PCI PassThrough work
      # lspci | egrep -i 'nvidia|mellanox'
      00:04.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function]
      00:05.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
      00:06.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
    6. Repeat steps 3-5 to run and validate second VM

     

    Now your Openstack deployment is fully complete and ready to run TensorFlow.

    To finally configure and run TensorFlow on just created VMs please refer to Reference Deployment Guide for RDMA over Ethernet (RoCE) accelerated TensorFlow with an NVIDIA GPU Card over Mellanox 100 GbE Network

     

    TensorFlow Benchmarks on Bare metal servers vs. Virtual Machines

    Here we will provide our RDMA over Ethernet (RoCE) accelerated TensorFlow performance benchmark results for InceptionV3, ResNet-50, ResNet-152 and VGG 16 on Bare metal servers and Virtual Machines.

    Benchmarks ran using both real and synthetic data. Testing with synthetic data was done by using a tf.Variable set to the same shape as the data expected by each model for ImageNet.

    We start with synthetic data to remove disk I/O as a variable and to set a baseline. Real data is then used to verify that the TensorFlow input pipeline and the underlying disk I/O are saturating the compute units.

    Server's hardware and configurations used for Bare metal servers and Virtual Machines benchmarks are identical.

     

    Details for our benchmarks

    Environment

    • Instance type: See setup overview
    • GPU: 2x NVIDIA® Tesla® K40
    • OS: Ubuntu 16.04 LTS
    • CUDA / cuDNN: 8.0 / 6.0
    • TensorFlow GitHub : r1.3 GA
    • Build Command: bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
    • Disk: Local NVMe
    • DataSet: ImageNet 2012
    • Test Date: Oct 2017

     

    The batch size and optimizer used for the tests are listed in the table.

     

    OptionsInceptionV3ResNet-50ResNet-152VGG 16
    Batch size per GPU64643232
    Optimizersgdsgdsgdsgd

     

    Configuration used for each model.

     

    Modelvariable_updatelocal_parameter_devicecross_replica_sync
    InceptionV3distributed_replicatedn/aTrue
    ResNet-50distributed_replicatedn/aTrue
    ResNet-152distributed_replicatedn/aTrue
    VGG 16distributed_replicatedn/aTrue

     

    The server setup for the runs is included 4 worker servers and was explained in the setup overview part of the document.

     

    Results

    Training synthetic data

    Modelbatch size

    Bare metal

    Image/sec

    Virtual Machine

    Image/sec

    InceptionV364124.70124.68
    ResNet-5064213.89231.89
    ResNet-1523276.9676.95
    VGG 163274.4083.41

     

     

    Known issues

    The following are the RHEL-OSP 11 plugin known limitations:

    Table 5: Known issues

    #IssueWorkaround
    1

    deploy.sh example script generates the password files in the deploy directory

    Use your own deployment command or the generated example from the deploy directory.