Docker RDMA SRIOV Networking with ConnectX4/ConnectX5

Version 14

    This post shows how to use Mellanox ConnectX-4/ConnectX-5 InfiniBand or RoCE HCAs with Docker Containers.

     

     

    Overview

     

    RDMA Support to Docker containers is provided using virtual RDMA devices (vHCA) implemented using SRIOV capability of the Mellanox ConnectX-4/ConnectX-5 HCAs.

    Virtual networking devices made available using IPoIB or Ethernet SRIOV HCA. Virtual networking devices are created using SR-IOV VF. Each such IPoIB or Ethernet networking device and vHCA will be provisioned for a Container using SRIOV networking plugin and Docker runtime tool docker_rdma_sriov.

     

    Configuration and setup involves following steps.

    1. Enable Virtualization (SR-IOV) in the BIOS (prerequisites)
    2. Enable SR-IOV in the HCA
    3. Enable Virtualization in OpenSM
    4. Install SRIOV plugin
    5. Install docker_rdma_sriov tool
    6. Start SRIOV network plugin
    7. Create one or more tenant networks
    8. Start Containers
    9. Delete network

     

     

    Prerequisites

    Mellanox OFED 4.4 or higher must be installed on the host.

     

    Configuration

    1.  Enable Virtualization (SR-IOV) in the BIOS (prerequisites)

         Make sure that SR-IOV is enabled in the BIOS of the specific server. Each server has different BIOS configuration options for virtualization. See HowTo Set Dell PowerEdge R730 BIOS parameters to support SR-IOV for BIOS configuration examples.

     

    2.  Enable SR-IOV in HCA

    (a) Run MFT

    # mst start

    Starting MST (Mellanox Software Tools) driver set

    Loading MST PCI module - Success

    Loading MST PCI configuration module - Success

    Create devices

    (b) Locate the HCA device on the desired PCI slot.

    MST modules:

    ------------

        MST PCI module loaded

        MST PCI configuration module loaded

    MST devices:

    ------------

    /dev/mst/mt4115_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:05:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

    (c) Query the Status of the device

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current

             SRIOV_EN                            0              

             NUM_OF_VFS                          0              

             LINK_TYPE_P1                        2              

             LINK_TYPE_P2                        2              

    ...

    (d) Enable SR-IOV , set the desired number of VFs.

    • SRIOV_EN=1
    • NUM_OF_VFS=4  ; This is an example with 4 VFs

    # mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current         New

             SRIOV_EN                            0               1              

             NUM_OF_VFS                          0               4              

             LINK_TYPE_P1                        2               2              

             LINK_TYPE_P2                        2               2              

    ...

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

    Note: At this point, the VFs are not seen via lspci. Only when SR-IOV is enabled on the MLNX_OFED driver, you will be able to see them.

     

    (e) Reboot the server.

     

    3. Ensure that opensm is configured for virtualization mode.

    This is NOT needed for RoCE. This is only needed for InfiniBand.

    (a) Make sure that the openSM is enabled with virtualization. Open the file  /etc/opensm/opensm.conf and add:

    virt_enabled 2

    Note: This is relevant only for mlx5 driver and not for mlx4 (ConnectX-3/Pro).

    This parameter has the following configuration options:

    • 0: Ignore Virtualization - No virtualization support (Default)
    • 1: Disable Virtualization - Disable virtualization on all Virtualization supporting ports
    • 2: Enable Virtualization - Enable (virtualization on all Virtualization supporting ports)

    The Default for parameter is 0 (ignore virtualization).

     

    (b) Restart the opensm after above configuration.

    Without this configuration VF ports will be in down state.

     

    4. Install or update SR-IOV plugin

    # docker pull rdma/sriov-plugin

     

    5. Install or update docker_rdma_sriov tool

    Current Linux kernel doesn’t provide sufficient amount of isolation of RDMA devices.

    This simple tool is wrapper to docker run command that provides this capability to have dedicated RDMA and Networking device along with sriov-plugin.

    # docker pull rdma/container_tools_installer
    # docker run --net=host -v /usr/bin:/tmp rdma/container_tools_installer

     

    6. Start SR-IOV networking plugin

    # docker run -v /run/docker/plugins:/run/docker/plugins -v /etc/docker:/etc/docker -v /var/run:/var/run --net=host --privileged rdma/sriov-plugin

     

    7. Create one or more tenant networks

    # docker network create -d sriov --subnet=194.168.1.0/24 -o netdevice=ens2f0 mynet

    Here ens2f0 is a PF netdevice. Change it to right name based on your system configuration. This enables SRIOV for the PF netdevice ens2f0 and does necessary configuration for InfiniBand or RoCE such as vlan, GUID, mac address, privilege, trust mode.

     

    8. Run a container

    User must pick the free IP address for a subnet as sriov plugin is local plugin which can assign IP address only on per host level. Due to that two hosts can same IP addresses.

    Therefore user must chose a unique IP address for a VF in given subnet.

    # docker_rdma_sriov run --net=mynet --ip=194.168.1.2 -it mellanox/mlnx_ofed_linux-4.4-1.0.0.0-centos7.4 bash

    Sample Dockerfile to build Mellanox OFED based on Centos or Ubuntu base image can be found in below git repository.

    GitHub - Mellanox/mofed_dockerfiles: MOFED Docker files

     

    9. Other network creation options.

     

    #docker network create <…>

    Below are the valid options.

    Valid options during network creation are:

    (a) netdevice=<netdevice_name>

    This option indicates to plugin which PF netdevice should be used whose VFs wil be provisioned with container.

    Example: -o netdevice=eth0

     

    (b) vlan=<vlan_id>

    This option indicate that all the VFs which belong to a given network should have vlan offload set or not. When vlan offload is set, HCA/NIC performs transparent vlan insertion or removal on RDMA and non RDMA packets.

    This option allows to isolate tenants from one to other based on the vlan id.

    Example: -o vlan=100

     

    (c) privileged=<0|1>

    This option indicates that all the VFs provisioned in this network can be treated as privileged VFs or not. This is useful in DPDK or nested virtualization applications where VF can be considered as privileged VF.

    Example: -o privileged=1

     

     

    (d) mode: sriov

    This indicates that plugin should work in sriov mode.

    Example: -o mode=sriov

     

    10. RDMA statistics of a container

    # docker_rdma_sriov stats CONTAINER_ID