Reference Deployment Guide for K8s cluster with Mellanox RDMA Device Plugin and MULTUS CNI plugin with two network interfaces( Flannel and Mellanox SR-IOV) (Draft)

Version 22

     

    In this document we will demonstrate a multi-rack deployment of the Kubernetes cluster over Mellanox end-to-end Ethernet network.

    For our deployment are used bare-metal servers running Ubuntu 16.04 server operating system, ConnectX-5 family NICs, Spectrum family switches and LinkX product family cables.

    NEOTM – Mellanox Network Orchestration and Management Software will provision and operate Leaf-Spine Ethernet fabric including switches, NIC's and cables.

    If you not familiar with Kubernetes please look references below.

     

    References

     

    Deployment has been tested only with Kubernetes version 1.11.3

    Please identify that you using the SR-IOV supported server platform.

    You should consult your hardware manufacturers documentation for the BIOS specific settings in order to enable support for SR-IOV networking.

     

     

    Introduction

    What is Kubernetes?

    Kubernetes is an open source system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.

     

    Mellanox components overview and benefits

    Mellanox Spectrum switch family provides the most efficient network solutions for the ever-increasing performance demands of Data Center applications. Mellanox ConnectX network adapter family delivers industry-leading connectivity for performance-driven server and storage applications. These ConnectX adapter cards enable high bandwidth, coupled with ultra-low latency for diverse applications and systems, resulting in faster access and real-time responses. Mellanox NEO™ is a powerful platform for managing computing networks. It enables data center operators to efficiently provision, monitor and operate the modern data center fabric. Mellanox NEO-Host is a powerful solution for orchestration and management of host networking. NEO-Host is integrated with the Mellanox NEO™ and can be deployed on Linux hosts managed by NEO.

     

     

    Solution Design

    Our reference deployment will be based on common Spine/Leaf Data Center network architectures for the Kubernetes cluster. We propose one of today's most common implementations, which is based on layer-3 routing protocols, such as OSPF. Technological platforms such as overlay networks (VXLAN, GENEVE) or RDMA over Converged Ethernet version 2 (RoCEv2) add further capabilities to layer-3 networks.

     

     

    .

     

     

    Solution Physical diagram and Network configuration

    Physical solution based on solution described in "L3 network design with OSPF at scale with Mellanox NEO".

    Each server in solution connected to switch with single interface.

    In order to provide connectivity to second interface in pod (Mellanox SRIOV interface which is not managed by Kubernetes) we have configured the same VLAN interface (Vlan 111) on each Leaf switch with unique network settings for each Leaf switch.

    We have deployed Docker container with DHCP service on each Leaf switch for provide IPAM (IP address management) for Mellanox SRIOV interface.

    Docker deployment procedure described in "How-to Deploy Docker Container with DHCP service over Mellanox ONYX on Mellanox Spectrum switches".

    Switch configuration example provided below.

     

     

    Bill of Materials - BOM

     

     

     

    Server installation

     

    Ubuntu server ver. 16.04 is the chosen OS. Each server under this deployment configured with static IP.

    Here we use three physical servers: one master, and two workers. In this deployment we use the following servers with below provided specification :

    Server nameIP Address
    clx-host-02010.215.15.1/24
    clx-host-02110.215.15.2/24
    clx-host-02210.215.16.1/24

     

     

     

    This document does not cover the server storage aspect. You should configure the server storage components in accordance with your intended use.

     

    Enable intel_iommu in grub configuration

    Update file /etc/default/grub with

     

    GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"

     

    Then run

     

    # update-grub && reboot

     

    Installing MOFED for Ubuntu

    This chapter describes the installation process of the MOFED Linux package on a single host machine.

    MOFED is an additional software component by Mellanox which provides the latest drivers and firmware versions.

    For more information click on Mellanox OFED for Linux User Manual.

    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.

     

    # lspci -v | grep Mellanox

     

    The following example shows a system with an installed Mellanox HCA:

    2. Download the latest ISO image (according to your OS) into your servers share folder.

    The image name comes in the following format: MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso.

    You can download it from:

    http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

    In our case we downloaded - MLNX_OFED_LINUX-4.4-2.0.7.0-ubuntu16.04-x86_64.iso.

    3. Use the MD5SUM utility to confirm the integrity of the downloaded file. Run the following command and compare the result to the value provided on the download page.

    # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

     

    Installing Mellanox OFED

    MLNX_OFED is installed by running themlnxofedinstall script. This installation script performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available in the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs their new versions. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.

    2. Copy the downloaded ISO to /root

    3. Mount the ISO image on your machine.

    # mkdir /mnt/iso# mount -o loop /share/MLNX_OFED_LINUX-4.4-2.0.7.0-ubuntu16.04-x86_64.iso /mnt/iso# cd /mnt/iso

    4. Run the installation script and reboot

    # ./mlnxofedinstall

    # reboot

    5. Enable SR-IOV , set the desired number of VFs(virtual function).

    Set parameters for SRIOV_EN=1 and NUM_OF_VFS=8 ; This is an example with 8 VFs

    # mlxconfig -d /dev/mst/mt4121_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=8

    # reboot

     

    Amount of VF available for activation depends on your server hardware platform.

     

    K8s Cluster Deployment Guide

     

    Deployment has been tested only with Kubernetes version 1.11.3.

     

    All configurations files associated with this deployment (YAML, CONF) are given in the Appendix below with descriptions.

    All original YAML configuration files has been customized.

     

    Deployments steps

     

    1. On each server install:

    • Kernel,Docker and components:

    # apt-get update && apt-get -y install linux-generic-hwe-16.04

    # apt-get -y upgrade && apt-get -y install apt-transport-https aufs-tools

    # apt-get -y install docker.io=17.03.2-0ubuntu2~16.04.1

    # systemctl start docker

    # systemctl enable docker

    • Kubernetes apt repository and components:

    # curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -

    # echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
    # apt-get update && apt-get install -y kubelet=1.11.3-00 kubeadm=1.11.3-00 kubectl=1.11.3-00 kubernetes-cni aufs-tools
    • Disable swap:
    # swapoff –a

    Please remark swap settings in /etc/fstab.

    2. Initialize master node:

    # kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=10.215.15.1 --token-ttl=0 --kubernetes-version stable-1.11

     

    Once complete, you will be presented with the exact command “kubeadm join …” that you need to execute on each worker node in order to join it as master.

    Before you join a node, you need to configure your environment:

    # mkdir -p $HOME/.kube

    # sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

    # sudo chown $(id -u):$(id -g) $HOME/.kube/config

     

    Alternatively, if you are the root user, you can run this command:

    # export KUBECONFIG=/etc/kubernetes/admin.conf

     

    3. Pod network add-on selection:

    At the end of kubeadm initialization you must choose a Pod network add-on in order to establish communication between deployed pods.

    In this deployment we choose MULTUS CNI with two network interfaces: Flannel and Mellanox SRIOV.

    The following diagram shows network control flow with Multus CNI plugin. (Diagram credit: Intel):

     

    4. Joining a worker node:

    • Please execute on each worker node the “kubeadm join …” command specified above.
    • Please label with "name=sriov-cni-ds" each SR-IOV enabled node. For our case is:

    kubectl label nodes clx-host-021 name=sriov-cni-ds

    kubectl label nodes clx-host-022 name=sriov-cni-ds

     

    5. CNI's and Mellanox Device plugin deployment steps:

    • Mellanox SRIOV device plugin configuration and installation

    Please download rdma-sriov-node-config.yml configuration file from Appendix.

    Edit the file to have appropriate PF netdevice interface name such as ib0, ipoib0, ens2f0.

    Example in our case:

    Create Kubernetes Configmap:

    kubectl create -f rdma-sriov-node-config.yml

    Please download device-plugin.yaml and deploy device plugin by command below:

    kubectl create -f device-plugin.yaml

    This would read the ConfigMap created in previous setup, creates SR-IOV VF devices and initializes them.

    Detail installation procedure described in Kubernetes IPoIB/Ethernet RDMA SR-IOV Networking with ConnectX4/ConnectX5 step #5.

     

    • Multus may be deployed as a Daemonset. Flannel is deployed as a pod-to-pod network that is used as our "default network".
      Firstly, download customized multus-daemonset.yml and flannel-daemonset.yml from Appendix.
      We'll apply these
      files as like command below:
    # cat ./{multus-daemonset.yml,flannel-daemonset.yml} | kubectl apply -f -

    Detail installation procedure described in GitHub - intel/multus-cni: Multi-homed pod cni.

    Source Daemonset files can be found in https://github.com/intel/multus-cni/tree/master/images

     

    • Mellanox SR-IOV CNI plugin installation and configuration

    Please download customized k8s-sriov-cni-installer.yaml from Appendix and install Mellanox SRIOV CNI plugin with below provided command:

     

    # kubectl apply -f k8s-sriov-cni-installer.yaml

     

    Detail installation procedure described in Kubernetes IPoIB/Ethernet RDMA SR-IOV Networking with ConnectX4/ConnectX5 step #6.

     

    • Final CNI configuration with DHCP IPAM

    For final CNI configuration we have used DHCP-CNI-PLUGIN provided by GitHub - OpenSourceLAN/dhcp-cni-plugin: Run your Kubernetes containers with DHCP from your LAN and customized it for our needs.

    Please download customized dhcp-daemonset.yaml from Appendix and apply with below provided command:

     

    # kubectl apply -f dhcp-daemonset.yaml

     

    Check setup deployment

    1. Check node status

     

     

    2. Check deployed pods in kube-system namespace

     

     

     

    3. Check device plugin pod status

    Log example of the pod with successfully activated Mellanox device plugin:

    # kubectl logs rdma-sriov-dp-ds-mghnl -n kube-system

    2018/10/29 08:37:10 Starting K8s RDMA SRIOV Device Plugin version= 0.2

    2018/10/29 08:37:10 Starting FS watcher.

    2018/10/29 08:37:10 Starting OS watcher.

    2018/10/29 08:37:10 Reading /k8s-rdma-sriov-dev-plugin/config.json

    2018/10/29 08:37:10 loaded config: {"mode":"sriov","pfNetdevices":["ens2f0"]}

    2018/10/29 08:37:10 sriov device mode

    Configuring SRIOV on ndev= ens2f0 6

    max_vfs = 8

    cur_vfs = 8

    vf = &{7 virtfn7 true false}

    vf = &{5 virtfn5 true false}

    vf = &{3 virtfn3 true false}

    vf = &{1 virtfn1 true false}

    vf = &{6 virtfn6 true false}

    vf = &{4 virtfn4 true false}

    vf = &{2 virtfn2 true false}

    vf = &{0 virtfn0 true false}

    2018/10/29 08:37:22 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock

    2018/10/29 08:37:22 Registered device plugin with Kubelet

    exposing devices: [&Device{ID:ea:d6:ac:23:44:57,Health:Healthy,} &Device{ID:9a:12:40:2a:4a:70,Health:Healthy,} &Device{ID:0e:96:b3:41:99:a3,Health:Healthy,} &Device{ID:ae:7c:3c:40:be:89,Health:Healthy,} &Device{ID:e2:79:48:de:c2:fa,Health:Healthy,} &Device{ID:ea:dd:89:a9:09:40,Health:Healthy,} &Device{ID:96:10:d3:21:9f:0f,Health:Healthy,} &Device{ID:66:50:49:50:39:f3,Health:Healthy,}]

     

    4. Kubelet configuration files for master and worker nodes. Master node is excluded from Device plugin and Mellanox CNI deployments.

    For master node Kubelet configuration file must look like:

     

    For worker node Kubelet configuration file must look like:

     

    Test pod deployment and benchmarks

    1. Test pod deployment
      Please download from Appendix test pod deployment YAML file and apply it by following command:
      kubectl apply -f test-sriov-dep.yml
      Example of our test deployment file:

      After finish deployment you must receive output like:



    2. Benchmark results
      For benchmark comparison we are used standard tools like IB_WRITE_BW.
      Before execute this tools we must determinate which VF device connected to selected pod.
      It can done with command like:

      # kubectl exec -it mofed-test-pod-576584fbf8-np7k2 ibdev2netdev
      mlx5_5 port 1 ==> net1 (Up)

      From output we can see that pod connected to VF with mlx5_5 device.
      In our case we run test on two different pods which deployed on separate worker nodes:

     

    • Server side

    # kubectl exec -it mofed-test-pod-576584fbf8-np7k2 /bin/bash

     

    pod)# ibdev2netdev

    mlx5_5 port 1 ==> net1 (Up)


    pod)# ip a s dev net1

    145: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

    link/ether f6:0c:d4:e3:e2:53 brd ff:ff:ff:ff:ff:ff

    inet 10.215.222.4/24 brd 10.215.222.255 scope global net1

    valid_lft forever preferred_lft forever

    inet6 fe80::f40c:d4ff:fee3:e253/64 scope link

    valid_lft forever preferred_lft forever


    (pod)# ib_write_bw -d mlx5_5 -a -F --report_gbits

     

    • Client side:

     

    # kubectl exec -it mofed-test-pod-576584fbf8-fmhtf /bin/bash

     

    (pod) # ibdev2netdev

    mlx5_3 port 1 ==> net1 (Up)

     

    (pod)# ib_write_bw -d mlx5_3 -a -F 10.215.222.4 --report_gbits

    Benchmark test output looks like:

     

     

     

    Appendix

     

    1. Spine switch configuration files
      Attached below Spine-1.txt and Spine-2.txt
    2. Leaf switch configuration files with Docker DHCP service configuration
      Attached below configuration files Leaf-1.txt and Leaf-2.txt for Leaf switches.
      dhcp.conf examples are provided below:

      dhcp.conf file for Leaf-1 switch:

      #dhcp.conf file for Leaf-1 DHCP service

      interface=swid0_eth.111
      dhcp-range=vlan111,10.215.111.1,10.215.111.127,255.255.255.0,60m
      dhcp-option=option:classless-static-route,10.215.222.0/24,10.215.111.254


      log-queries
      log-dhcp
      dhcp-sequential-ip

       

       

      dhcp.conf file for Leaf-2 switch:

      #dhcp.conf file for Leaf-2 DHCP service

      interface=swid0_eth.111
      dhcp-range=vlan111,10.215.222.1,10.215.222.127,255.255.255.0,60m

      dhcp-option=option:classless-static-route,10.215.111.0/24,10.215.222.254

       

      log-queries
      log-dhcp
      dhcp-sequential-ip

    3. Kubernetes deployment YAML files attached below:
      • multus-daemonset.yml and flannel-daemonset.yml - Multus with Flannel daemonsets
      • rdma-sriov-node-config.yml - configmap deployment file for Mellanox device plugin
      • device-plugin.yaml - Mellanox device plugin daemoset
      • k8s-sriov-cni-installer.yaml - Mellanox SRIOV CNI plugin daemonset
      • dhcp-daemonset.yaml - DHCP IPAM management for Mellanox SRIOV interface daemonset
      • test-sriov-dep.yml - example of RDMA enabled pod deployment