Kubernetes IPoIB/Ethernet RDMA SR-IOV Networking with ConnectX4/ConnectX5

Version 17

    This post shows how to use Mellanox ConnectX-4/ConnectX-5 InfiniBand or Ethernet HCAs, to do RDMA and IP networking in Kubernetes cluster.

    You must use Kubernetes version 1.10.3 or higher.

     

    Overview

    RDMA and IP networking in Kubernetes cluster using InfiniBand or Ethernet network is achieved by having dedicated networking device for each Kubernetes Pod.

    Virtual networking and RDMA devices made available using SR-IOV HCA which provides IPoIB or Ethernet netdevice and virtual HCA (vHCA). Virtual networking and vHCA devices are created using SR-IOV VFs. Each such IPoIB/Ethernet networking and RDMA vHCA device will be provisioned for a Pod by the SR-IOV CNI software.

     

     

    Configuration and setup involves following steps.

    1. Enable Virtualization (SR-IOV) in the BIOS (prerequisites)
    2. Enable SR-IOV in the HCA
    3. OpenSM virtualization configuration
    4. Check Kubernetes Cluster setup
    5. SR-IOV device plugin configuration and installation
      1. Configure PF netdevice name
      2. Create device plugin configuration
    6. SR-IOV CNI plugin installation and configuration
      1. Install SRIOV CNI plugin
      2. Configure CNI plugin
    7. Check configuration
      1. Check device plugin configuration
      2. Check sriov cni installation
    8. Pod Configuration
    9. Debugging issues

     

    Prerequisites

    Mellanox OFED 4.4 or higher must be installed on the host.

     

    Configuration

    1.  Enable Virtualization (SR-IOV) in the BIOS (prerequisites)

         Make sure that SR-IOV is enabled in the BIOS of the specific server. Each server has different BIOS configuration options for virtualization. See HowTo Set Dell PowerEdge R730 BIOS parameters to support SR-IOV for BIOS configuration examples.

     

    2.  Enable SR-IOV in HCA

    (a) Run MFT

    # mst start

    Starting MST (Mellanox Software Tools) driver set

    Loading MST PCI module - Success

    Loading MST PCI configuration module - Success

    Create devices

    (b) Locate the HCA device on the desired PCI slot.

    MST modules:

    ------------

        MST PCI module loaded

        MST PCI configuration module loaded

    MST devices:

    ------------

    /dev/mst/mt4115_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:05:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

    (c) Query the Status of the device

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current

             SRIOV_EN                            0              

             NUM_OF_VFS                          0              

             LINK_TYPE_P1                        2              

             LINK_TYPE_P2                        2              

    ...

    (d) Enable SR-IOV , set the desired number of VFs.

    • SRIOV_EN=1
    • NUM_OF_VFS=4  ; This is an example with 4 VFs

    # mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4

     

    Device #1:

    ----------

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

    Configurations:                              Current         New

             SRIOV_EN                            0               1              

             NUM_OF_VFS                          0               4              

             LINK_TYPE_P1                        2               2              

             LINK_TYPE_P2                        2               2              

    ...

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

    Note: At this point, the VFs are not seen via lspci. Only when SR-IOV is enabled on the MLNX_OFED driver, you will be able to see them.

    Kubernetes SR-IOV RDMA device plugin enables the SR-IOV and does necessary configuration for a VF. User must not enable SR-IOV at driver level

     

    (e) Reboot the server.

     

    3. Ensure that opensm is configured for virtualization mode.

    (a) Make sure that the openSM is enabled with virtualization. Open the file  /etc/opensm/opensm.conf and add:

    virt_enabled 2

    Note-1: This is relevant only for mlx5 driver and not for mlx4 (ConnectX-3/Pro).

    Note-2: This is relevant only for InfiniBand. It is not applicable for RoCE/Ethernet.

    This parameter has the following configuration options:

    • 0: Ignore Virtualization - No virtualization support (Default)
    • 1: Disable Virtualization - Disable virtualization on all Virtualization supporting ports
    • 2: Enable Virtualization - Enable (virtualization on all Virtualization supporting ports)

    The Default for parameter is 0 (ignore virtualization).

     

    (b) Restart the opensm after above configuration.

    Without this configuration VF ports will be in down state.

     

    4. Make sure Kubernetes cluster is setup but it is not Ready for the due to networking.

    Once SR-IOV CNI configuration is done, nodes become ready. SR-IOV CNI configuration is explained further below in the document.

    If due to past networking configuration of the Kubernetes, if the cluster is up, it it unlikely SR-IOV CNI will work. Kubernetes picks up the first CNI configuration file in alphabetical order.

    So in such case user must backup their existing networking CNI configuration before having SR-IOV CNI configuration.

    #kubectl get nodes

     

    5. SR-IOV device plugin configuration and installation

    SR-IOV device plugin enables SR-IOV, performs necessary VF configuration. Device plugin expects that in a given Kubernetes cluster, all PF netdevice must be same such as ib0 or ipoib0 or any other unique name in Kubernetes cluster.

     

    (a) Configure netdevice name

         User must configure PF netdevice name to be same on all the Kubernetes nodes.

    #ip link set ib0 name ipoib0

    If it is Ethernet interface it would be similar as below.

    #ip link set ens2f0 name eth_roce0

    For persistence networking device naming across system reboots, udev rules should be used.

     

    (b) Create device plugin configuration

    • Get template configuration file
    # wget https://cdn.rawgit.com/Mellanox/k8s-rdma-sriov-dev-plugin/7b27f8cf/example/sriov/rdma-sriov-node-config.yml
    • Edit rdma-sriov-node-config.yml file to have appropriate PF netdevice interface name such as ib0, ipoib0, ens2f0.
    • Example:

    "pfNetdevices": [ "ipoib0” ]

     

    (c) Create Kubernetes ConfigMap

    # kubectl create -f rdma-sriov-node-config.yml

    (d) Deploy device plugin

    # kubectl create -f https://cdn.rawgit.com/Mellanox/k8s-rdma-sriov-dev-plugin/048ceb52/example/device-plugin.yaml

    This would read the ConfigMap created in previous setup, creates SR-IOV VF devices and initializes them.

     

    6. SR-IOV CNI plugin installation and configuration

    (a) Download and Install SRIOV CNI plugin

     

    # wget https://cdn.rawgit.com/Mellanox/sriov-cni/master/k8s-installer/k8s-sriov-cni-installer.yaml

    # kubectl apply -f k8s-sriov-cni-installer.yaml

    This installs sriov CNI binaries and creates SRIOV configuration template in /etc/cni/net.d/10-sriov-cni.conf.

     

    (b) Configure SRIOV CNI plugin

    In /etc/cni/net.d/10-sriov-cni.conf edit interface name, ip address range and gateway address. Interface name for if0 must be same as that configured in rdma-sriov-node-config.yaml.

    /etc/cni/net.d/10-sriov-cni.conf looks like below. User must set following fields.

    if0 interface name must be same as that of what is configured in device plugin ConfigMap in step 3.b.

    {

        "name": "mynet",

        "type": "sriov",

        "if0": "ipoib0",

        "ipam": {

            "type": "host-local",

            "subnet": "10.55.206.0/26",

            "routes": [

                { "dst": "0.0.0.0/0" }

            ],

            "gateway": "10.55.206.1"

        }

    }

     

    (c) Check now that Kubernetes cluster status is up now.

    # kubectl get nodes

     

    A sample output looks like below with one master node (node2) and two worker nodes (node 3 and node4).

     

     

    NAMESTATUSROLESAGEVERSION   EXTERNAL-IP   OS-IMAGEKERNEL-VERSIONCONTAINER-RUNTIME
    node2Readymaster1dv1.10.4   <none>Ubuntu 16.04.4 LTS   4.13.0-45-genericdocker://1.13.1
    node3Ready<none>1dv1.10.4   <none>Ubuntu 16.04.4 LTS   4.13.0-45-genericdocker://1.13.1
    node4Ready<none>1dv1.10.4   <none>Ubuntu 16.04.4 LTS   4.13.0-45-genericdocker://1.13.1

     

     

     

    7. Check configuration

    (a) Check device sriov plugin and sriov cni and configuration

    #kubectl get ds --namespace=kube-system
    #kubectl get pods --namespace=kube-system

     

    8. Pod Configuration

    Each Pod’s container configuration needs to have configuration to indicate that it demands one virtual networking device.

    resources:

      limits:

        rdma/vhca: 1

    Example Pod configuration can be found below.

     

    https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin/blob/master/example/sriov/test-sriov-pod.yml

    Sample Dockerfile to build Mellanox OFED based on Centos or Ubuntu base image can be found in below git repository.

     

    GitHub - Mellanox/mofed_dockerfiles: MOFED Docker files

     

    9. Debugging issues

    # systemctl status -l kubelet
    # kubectl get configmap rdma-devices --namespace=kube-system -o json
    # kubectl get ds --namespace=kube-system
    # kubectl get pods --namespace=kube-system