Docker RoCE SRIOV Networking using OVS with ConnectX4/ConnectX5

Version 20

    This post shows how to use Mellanox ConnectX-4/ConnectX-5 InfiniBand or RoCE HCAs with Docker Containers using OVS.

     

     

    Overview

     

    RDMA Support to Docker containers is provided using virtual RDMA devices (vHCA) implemented using SRIOV capability of the Mellanox ConnectX-4/ConnectX-5 HCAs.

    Virtual networking devices made available using IPoIB or Ethernet SRIOV HCA. Virtual networking devices are created using SR-IOV VF. Each such IPoIB or Ethernet networking device and vHCA will be provisioned for a Container using SRIOV networking plugin and Docker runtime tool docker_rdma_sriov.

     

    Configuration and setup involves following steps.

    1. Enable Virtualization (SR-IOV) in the BIOS (prerequisites)
    2. Enable SR-IOV in the HCA
    3. Install SRIOV plugin
    4. Install docker_rdma_sriov tool
    5. Start SRIOV network plugin
    6. OVS configuration
    7. Perform SR-IOV configuration
    8. Netdevice OVS configuration for a VF
    9. Perform tc rule configuration
    10. Create one or more tenant networks
    11. Start Containers
    12. Delete network

     

     

    Prerequisites

    Mellanox OFED 4.4 or higher must be installed on the host.

    Configuration

    1.  Enable Virtualization (SR-IOV) in the BIOS (prerequisites)

         Make sure that SR-IOV is enabled in the BIOS of the specific server. Each server has different BIOS configuration options for virtualization. See HowTo Set Dell PowerEdge R730 BIOS parameters to support SR-IOV for BIOS configuration examples.

     

    2.  Enable SR-IOV in HCA

    (a) Run MFT

    # mst start

    Starting MST (Mellanox Software Tools) driver set

    Loading MST PCI module - Success

    Loading MST PCI configuration module - Success

    Create devices

    (b) Locate the HCA device on the desired PCI slot.

    MST modules:

    ------------

        MST PCI module loaded

        MST PCI configuration module loaded

    MST devices:

    ------------

    /dev/mst/mt4115_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:05:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

    (c) Query the Status of the device

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current

             SRIOV_EN                            0              

             NUM_OF_VFS                          0              

             LINK_TYPE_P1                        2              

             LINK_TYPE_P2                        2              

    ...

    (d) Enable SR-IOV , set the desired number of VFs.

    • SRIOV_EN=1
    • NUM_OF_VFS=4  ; This is an example with 4 VFs

    # mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current         New

             SRIOV_EN                            0               1              

             NUM_OF_VFS                          0               4              

             LINK_TYPE_P1                        2               2              

             LINK_TYPE_P2                        2               2              

    ...

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

    Note: At this point, the VFs are not seen via lspci. Only when SR-IOV is enabled on the MLNX_OFED driver, you will be able to see them.

     

    (e) Reboot the server.

     

    3. Install SR-IOV plugin

    # docker pull rdma/sriov-plugin

     

    4. Install docker_rdma_sriov tool

    Current Linux kernel doesn’t provide sufficient amount of isolation of RDMA devices.

    This simple tool is wrapper to docker run command that provides this capability to have dedicated RDMA and Networking device along with sriov-plugin.

    # docker pull rdma/container_tools_installer
    # docker run --net=host -v /usr/bin:/tmp rdma/container_tools_installer

     

    5. Start SR-IOV networking plugin

    # docker run -v /run/docker/plugins:/run/docker/plugins -v /etc/docker:/etc/docker -v /var/run:/var/run --net=host --privileged rdma/sriov-plugin

     

    6. OVS Configuration (it is persistent across reboot)

    # systemctl enable openvswitch

    # systemctl start openvswitch

    # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

    # ovs-vsctl add-br ovs-sriov

    # ovs-vsctl add-port ovs-sriov ens2f0

    6.1 TC configuration for PF netdevice

    # tc qdisc add dev ens2f0 ingress

     

    7. Perform SR-IOV Configuration on every system (non persistent across reboot).

    Here ens2f0 is a PF netdevice which needs to be configured using above sequence.

     

    (a) SRIOV and switchdev mode configuration

    # docker_rdma_sriov sriov enable -n ens2f0

    # docker_rdma_sriov sriov unbind -n ens2f0

    # docker_rdma_sriov devlink set -n ens2f0 -m switchdev

    (b) Bind VFs of a PF

    # docker_rdma_sriov sriov bind -n ens2f0

     

    Now below steps are per VF configuration and usage steps.

     

    8. Netdevice OVS configuration for one VF (it is persistent across reboot)

    Add representor netdevice to the OVS bridge.

    # ovs-vsctl add-port ovs-sriov ens2f0_3

     

    9. Perform tc rules configuration for VF 3 using its representor netdevice.

    Ingress rule to receive traffic for the VF.

    # tc qdisc add dev ens2f0_3 ingress

    vf_mac_addr=$(cat /sys/class/net/<VF_netdev_for_VF_3>/address)

    # tc filter add dev ens2f0 protocol ip parent ffff: flower ip_proto udp dst_port 4791 skip_sw dst_mac $vf_mac_addr action mirred egress redirect dev ens2f0_3

     

    Egress rule to transmit traffic of the VF to the network. Here the dest_mac_addr is the destination mac address of the VF other host.

    # tc filter add dev ens2f0_3 protocol ip parent ffff: flower skip_sw ip_proto udp dst_mac 00:00:00:00:00:00/01:00:00:00:00:00 dst_port 4791 action mirred egress redirect dev ens2f0

    Here ens2f0 is a PF netdevice and ens2f0_3 is the representor netdevice for VF 3. (start index 0).

     

    10. Create one or more tenant networks on each host (It is persistent configuration across reboots)

    # docker network create -d sriov --subnet=194.168.1.0/24 -o netdevice=ens2f0 mynet

    Here ens2f0 is a PF netdevice. Change it to right name based on your system configuration. User must enable SR-IOV as explained above.

     

    11. Run a container

    User must pick the free IP address for a subnet as sriov plugin is local plugin which can assign IP address only on per host level. Due to that two hosts can same IP addresses.

    Therefore user must chose a unique IP address for a VF in given subnet.

    User must pick the free VF based on the tc rules configuration done in above steps.

    # docker pull mellanox/centos_7_2_mofed_4_4_0_1_9_0

    # docker_rdma_sriov run --net=mynet --vf=3 --ip=194.168.1.2 -it mellanox/mlnx_ofed_linux-4.4-1.0.0.0-centos7.4 bash

    Sample Dockerfile to build Mellanox OFED based on Centos or Ubuntu base image can be found in below git repository.

    GitHub - Mellanox/mofed_dockerfiles: MOFED Docker files

     

    12. Run a container in dpdk mode

    # docker pull mellanox/centos_7_2_mofed_4_4_0_1_9_0_dpdk

    # docker_rdma_sriov run --net=mynet --vf=5 --ip=194.168.1.2 --cap-add=NET_ADMIN -it mellanox/mlnx_ofed_linux-4.4-1.0.0.0-centos7.4_dpdk bash

    13. RDMA statistics of a container

    # docker_rdma_sriov stats CONTAINER_ID