How to create a Linux Container (LXD) with RDMA accelerated applications over 100Gb Infiniband Network

Version 11

    In this document we will demonstrate a deployment procedure of RDMA accelerated applications running on Linux Containers (LXC) and Mellanox end-to-end 100 Gb/s Infiniband (IB) solution.

    This document describes the process of building the LXD Container from sources for Ubuntu 16.04.2 LTS and LXD 2.16 on physical servers.

    We will show how to update and install the Mellanox software and hardware components on host and on LXD container.

    References

     

     

    Setup Overview

     

    Equipment

     

     

    Server Logical Design

     

    Server Wiring

     

    In our reference we'll wire 1st port to InfiniBand switch and will not use a 2nd port.

     

     

     


    Network Configuration

    The will use in our install setup two servers.

    Each servers will connected to the SB7700 switch by a 100Gb IB copper cable. The switch port connectivity in our case is as follow:

    • 1st -2th ports – connected to Host servers

    Server names with network configuration provided below

    Server typeServer nameIP and NICS               
    Internal networkExternal network
    Server 01clx-mld-41ib0: 12.12.12.41eno1: From DHCP (reserved)
    Server 02clx-mld-42ib0: 12.12.12.42eno1: From DHCP (reserved)


    Deployment Guide


    Prerequisites


    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

     


    Enable the Subnet Manager(SM) on the IB Switch

     

    Refer to the MLNX-OS User Manual to become familiar with switch software (located at support.mellanox.com).
    Before starting to use of the Mellanox switch, we recommend that you upgrade the switch to the latest MLNX-OS version.

    There are three options to select the best place to locate the SM:

    1. Enabling the SM on one of the managed switches. This is a very convenient and quick operation and make Infiniband ‘plug & play’ easily.
    2. Run /etc/init.d/opensmd on one or more servers. It is recommended to run the SM on a server in case there are 648 nodes or more.
    3. Use Unified Fabric Management (UFM®) Appliance dedicated server. UFM offers much more than the SM. UFM needs more compute power than the existing switches have, but does not require an expensive server. It does represent additional cost for the dedicated server.

    We'll explain options 1 and 2 only

    Option 1: Configuring the SM on a Switch MLNX-OS® all Mellanox switch systems.
    To enable the SM on one of the managed switches follow the next steps.

    1. Login to the switch and enter to config mode:
      Mellanox MLNX-OS Switch Management

      switch login: admin
      Password:
      Last login: Wed Aug 12 23:39:01 on ttyS0

      Mellanox Switch

      switch [standalone: master] > enable
      switch [standalone: master] # conf t
      switch [standalone: master] (config)#
    2. Run the command:
      switch [standalone: master] (config)#ib sm
      switch [standalone: master] (config)#
    3. Check if the SM is running. Run:

      switch [standalone: master] (config)#show ib sm
      enable
      switch [standalone: master] (config)#

    To save the configuration (permanently), run:

    switch (config) # configuration write

     

     

    Option 2: Configuring the SM on a Server (Skip this procedure if you enable SM on switch)

    To start up OpenSM on a server, simply run opensm from the command line on your management node by typing:

    # opensm

    Or:

    Start OpenSM automatically on the head node by editing the /etc/opensm/opensm.conf file.

    Create a configuration file by running:

    # opensm –config /etc/opensm/opensm.conf

    Edit /etc/opensm/opensm.conf file with the following line:

    onboot=yes

    Upon initial installation, OpenSM is configured and running with a default routing algorithm. When running a multi-tier fat-tree cluster, it is recommended to change the following options to create the most efficient routing algorithm delivering the highest performance:

    –routing_engine=updn

    For full details on other configurable attributes of OpenSM, see the “OpenSM – Subnet Manager” chapter of the Mellanox OFED for Linux User Manual.

     

    Installation Mellanox OFED for Ubuntu on a Host

    This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed. For more information click on Mellanox OFED for Linux User Manual.

     

    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your host.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
      http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.

       

      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.tgz

       

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded tgz to /tmp
      # cd /tmp
      # tar -xzvf MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz
      # cd MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64/
    3. Run the installation script.
      # ./mlnxofedinstall
    4. Reboot after the installation finished successfully.

      # /etc/init.d/openibd restart

      # reboot

      By default both ConnectX®-5 VPI ports are initialized as Infiniband ports.

    5. Disable unused the 2nd port on the device(optional).
      Identify PCI ID of your NIC ports:

      # lspci | grep Mellanox

      05:00.0 Infiniband controller: Mellanox Technologies Device 1019

      05:00.1 Infiniband controller: Mellanox Technologies Device 1019

      Disable 2nd port
      # echo 0000:05:00.1 > /sys/bus/pci/drivers/mlx5_core/unbind
    6. Check the ports’ mode is Infiniband
      # ibv_devinfo

    7. If you see the following - You need to change the interfaces port type to Infiniband
      Capture.JPG
      Change the interfaces port type to Infiniband mode ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports.
      Change the mode to Infiniband. Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=1 is a Infiniband mode
      a. Start mst and see ports names
      # mst start
      # mst status

      b. Change the mode of both ports to Infiniband:

      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=1
      Port 1 set to IB mode
      # reboot

      After each reboot you need to Disable 2nd port.
      c. Queries Infiniband devices and prints about them information that is available for use from userspace.

       

      # ibv_devinfo

       

    8. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.

      # ibdev2netdev

      # ifconfig ib0 12.12.12.41 netmask 255.255.255.0

    9. Insert to the /etc/network/interfaces file the lines below after the following lines:

      # vim /etc/network/interfaces

      auto eno1

      iface eno1 inet dhcp

      The new lines:
      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
      Example:
      # vim /etc/network/interfaces

      auto eno1
      iface eno1 inet dhcp

      auto ib0
      iface ib0 inet static
      address 12.12.12.41
      netmask 255.255.255.0
    10. Check the network configuration is set correctly.
      # ifconfig -a

       

     

    LXD installing and configured

    LXD installing

    To install the LXD (current version 2.16), we recommend use official Ubuntu PPA (Personal Package Archive):

    $ sudo apt-add-repository ppa:ubuntu-lxc/stable

    $ sudo apt update

    $ sudo apt dist-upgrade

    $ sudo apt install lxd

    LXD configuring

    To config storage and network go through the whole LXD step by step setup with:

    $ sudo lxd init

    Here is an example execution of the “init” command. In the example we configure the installation with default "dir" storage backend and with a “lxdbr0” bridge as a convenience.

    This bridge comes unconfigured by default, offering only IPv6 link-local connectivity through an HTTP proxy.

    A warm recommendation is ZFS as it supports all the features LXD needs to offer the fastest and most reliable container experience.

    Do you want to configure a new storage pool (yes/no) [default=yes]? Enter

    Name of the new storage pool [default=default]: Enter

    Name of the storage backend to use (dir, btrfs, lvm) [default=dir]: Enter

    Would you like LXD to be available over the network (yes/no) [default=no]? Enter

    Would you like stale cached images to be updated automatically (yes/no) [default=yes]? Enter

    Would you like to create a new network bridge (yes/no) [default=yes]?Enter

    What should the new bridge be called [default=lxdbr0]? Enter

    What IPv4 address should be used (CIDR subnet notation, "auto" or "none") [default=auto]? Enter

    What IPv6 address should be used (CIDR subnet notation, "auto" or "none") [default=auto]? none

                   

    LXD has been successfully configured.

    You can then look at the “lxdbr0” bridge config with:

    $ lxc network show lxdbr0

    Its output is shown below.

    config:

      ipv4.address: 10.141.11.1/24

      ipv4.nat: "true"

      ipv6.address: none

    description: ""

    name: lxdbr0

    type: bridge

    Preparing Container's Network

    Create a /etc/dnsmasq.conf.lab file

    $ vim /etc/dnsmasq.conf.lab

    and add these lines:

    domain=lab-ml.cloudx.mlnx

    # verbose

    log-queries

    log-dhcp

    dhcp-option=6,8.8.8.8

    Run following commands to change ipv4 network and add dnsmasq.conf.lab configuration:

    $ lxc network set lxdbr0 ipv4.address 10.10.41.1/24                                                         

    $ lxc network set lxdbr0 raw.dnsmasq "conf-file=/etc/dnsmasq.conf.lab" 

    and  look at the “lxdbr0” bridge config with:

    $ lxc network show lxdbr0

    Its output is shown below.

    config:

      ipv4.address: 10.10.41.1/24

      ipv4.nat: "true"

      ipv6.address: none

      raw.dnsmasq: conf-file=/etc/dnsmasq.conf.lab

    description: ""

    name: lxdbr0

    type: bridge

    Changing LXD service configuration for container's static MAC and IP addresses (Optional)

    Run this procedure on each host.

    Edit lxd service file:

    $ vim /lib/systemd/system/lxd.service

    add following line ExecStartPost=/bin/bash -c 'rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254}; do echo "00:16:3e:41:01:$(printf '%02x' $i),10.10.41.$i,c43$i" >> /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts ; done'

    (change c41 in another hosts):

    [Service]

     

    EnvironmentFile=-/etc/environment

    ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load

    ExecStart=/usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log

    ExecStartPost=/usr/bin/lxd waitready --timeout=600

    ExecStartPost=/bin/bash -c 'rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254}; do echo "00:16:3e:41:01:$(printf '%02x' $i),10.10.41.$i,c41$i" >> /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts ; done'

     

    Restart the lxd service:

    $ systemctl daemon-reload

    $ killall -SIGHUP dnsmasq

    $ service lxd restart

    $ service lxd status

     

    Check /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts file:

    $ cat /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    00:16:3e:41:01:02,10.10.41.2,c412
    00:16:3e:41:01:03,10.10.41.3,c413

    00:16:3e:41:01:04,10.10.41.4,c414

    00:16:3e:41:01:05,10.10.41.5,c415

    00:16:3e:41:01:06,10.10.41.6,c416  

    ...                                            

    If you don't see it, please rerun lxd service and check again:

    $ service lxd restart

     

    Check LXD service status:

    $ service lxd status

    lxd.service - LXD - main daemon

       Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)

      Drop-In: /etc/systemd/system/lxd.service.d

               override.conf

       Active: active (running) since Thu 2017-08-10 14:57:33 IDT; 3min 38s ago

         Docs: man:lxd(1)

      Process: 6406 ExecStartPost=/bin/bash -c rm -f /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts && for i in {2..254};

      Process: 6326 ExecStartPost=/usr/bin/lxd waitready --timeout=600 (code=exited, status=0/SUCCESS)

      Process: 6314 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)

    Main PID: 6325 (lxd)

       

       Memory: 10.1M

          CPU: 324ms

       CGroup: /system.slice/lxd.service

               6325 /usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log

               6391 dnsmasq --strict-order --bind-interfaces --pid-file=/var/lib/lxd/networks/lxdbr0/dnsmasq.pid --e

     

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: using local addresses only for domain lxd

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: reading /etc/resolv.conf

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: using local addresses only for domain lxd

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: using nameserver 10.141.119.41#53

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: using nameserver 8.8.8.8#53

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: read /etc/hosts - 5 addresses

    Aug 10 14:57:33 clx-mld-41 dnsmasq-dhcp[6391]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    Aug 10 14:57:33 clx-mld-41 dnsmasq[6391]: read /etc/hosts - 5 addresses

    Aug 10 14:57:33 clx-mld-41 dnsmasq-dhcp[6391]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts

    Aug 10 14:57:33 clx-mld-41 systemd[1]: Started LXD - main daemon.

    and add static routing on each host by run (sample on host 43):

    $ sudo route add -net 10.10.42.0/24 gw 12.12.12.42

    $ sudo route

    Kernel IP routing table

    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

    10.10.42.0      12.12.12.42    255.255.255.0   UG   0      0        0 ib1

    10.10.41.0      *              255.255.255.0   U    0      0        0 lxdbr0

    10.141.119.0    *              255.255.255.0   U    0      0        0 enp129s0f0

    12.12.12.0      *              255.255.255.0   U    0      0        0 ib1

    Preparing LXC Container

    By default, LXD creates unprivileged containers. This means that root in the container is a non-root UID on the host. It is privileged against the resources owned by the container, but unprivileged with respect to the host, making root in a container roughly equivalent to an unprivileged user on the host. (The main exception is the increased attack surface exposed through the system call interface)

    Briefly, in an unprivileged container, 65536 UIDs are 'shifted' into the container. For instance, UID 0 in the container may be 100000 on the host, UID 1 in the container is 100001, etc, up to 165535. The starting value for UIDs and GIDs, respectively, is determined by the 'root' entry the/etc/subuid and /etc/subgid files.

    We need to request a container to run without a UID mapping by setting the security.privileged flag to true ( change it in default profile):

    $ lxc profile set default security.privileged true

    Note however that in this case the root user in the container is the root user on the host.

     

    Running verbs and RDMA-based applications on container, requires access to the host OS’s InfiniBand devices (uverbs interface). This access can granted to a container via run following command (Change default profile):

    $ lxc profile device add default uverbs1 unix-char source=/dev/infiniband/uverbs1

    Some of a host’s InfiniBand devices can be seen by checking the contents of the /dev/infiniband/ folder.

    $ sudo  ls /dev/infiniband

    issm0 issm1 rdma_cm ucm0 ucm1 umad0 umad1 uverbs0 uverbs1

    $ sudo ibdev2netdev

    mlx5_0 port 1 ==> enp5s0f0 (Down)

    mlx5_1 port 1 ==> ib0 (Up)

    In our example, there are two mlx5_ devices on the host, resulting in two ucm, umad, and uverbs interfaces in /dev/infiniband. At runtime, you choose which devices are exposed to which running containers. For our example, when running a single container, you may choose to expose second  InfiniBand to the running container.

    To show default profile run:

    $ lxc profile show default

    You should see output similar to the following:

    config:

      environment.http_proxy: ""

      security.privileged: "true"

      user.network_mode: ""

    description: Default LXD profile

    devices:

      eth0:

        nictype: bridged

        parent: lxdbr0

        type: nic

      root:

        path: /

        pool: default

        type: disk

      uverbs1:

        source: /dev/infiniband/uverbs1

        type: unix-char

    name: default

    Creating new Container

    The syntax to create is:

    lxc init images:{distro}/{version}/{arch} {container-name-here}

    To create Ubuntu 16.04 container that will use all 8 GPUs use following commands:

    $ lxc init ubuntu:16.04 c432

     

    Set static MAC address to the container:

    $ lxc config set c412 volatile.eth0.hwaddr "00:16:3e:41:01:02"

     

    That will create a new ubuntu 16.04 container as can be confirmed with:

    $ lxc list

    To push installs file to the container, use:

    $ lxc file push MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz c412/tmp/

    Another optional, make file sharing, mount a share directory to the container to access the installer and example files.

    $ lxc config device add c412 installs disk source=/root/installs path=/root/installs

    Starting container:

    $ lxc start c412

     

    To gain login and gain shell access in the container c412 , enter:

    $ lxc exec c412 -- bash

    Installing Container

     

    Update Ubuntu Software Packages

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update
    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

     

    Installation Mellanox OFED on a Container

    Verify that the system has a Mellanox network adapter (HCA/NIC) installed.

    # apt-get install pciutils
    # lspci -v | grep Mellanox

     

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Install required packages:
      # apt-get install -y net-tools ethtool perl lsb-release iproute2
    2. Log into the installation machine as root.

      # cd /tmp

      # tar -xzvf MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64.tgz

      # cd MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64/

    3. Run the installation script.

      # ./mlnxofedinstall --user-space-only --without-fw-update -q

    4. Check the mofed version and uverbs:

      # ofed_info -s

      MLNX_OFED_LINUX-4.1-1.0.2.0:

      # ls /dev/infiniband/

      uverbs1

    5. Run Bandwidth stress over IB in container.:

    Server

    ib_write_bw -a -d mlx5_1 &

    Client

    ib_write_bw -a -F $Server_IP -d mlx5_1 --report_gbits

    In this way you can run Bandwidth stress over IB between containers.

    Done!