Reference Deployment Guide for Kubernetes Cluster with Contiv CNI plugin over Mellanox 25Gb and 100Gb Ethernet Solutions

Version 11

    In this document we will demonstrate a multi-rack deployment of the Kubernetes cluster over Mellanox end-to-end Ethernet network. We will use bare-metal servers running Ubuntu 16.04, ConnectX-5 family NICs, Spectrum family switches and LinkX product family cables. NEOTM – Mellanox Network Orchestration and Management Software will provision and operate Leaf-Spine Ethernet fabric including switches, NIC's and cables.

     

    References

     

     

    Introduction

    What is Kubernetes?

    Kubernetes is an open source system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.

     

    Mellanox components overview and benefits

    Mellanox Spectrum switch family provides the most efficient network solutions for the ever-increasing performance demands of Data Center applications. Mellanox ConnectX network adapter family delivers industry-leading connectivity for performance-driven server and storage applications. These ConnectX adapter cards enable high bandwidth, coupled with ultra-low latency for diverse applications and systems, resulting in faster access and real-time responses. Mellanox NEO™ is a powerful platform for managing computing networks. It enables data center operators to efficiently provision, monitor and operate the modern data center fabric. Mellanox NEO-Host is a powerful solution for orchestration and management of host networking. NEO-Host is integrated with the Mellanox NEO™ and can be deployed on Linux hosts managed by NEO.

     

    Related References

     

     

    Solution Design

    Our reference deployment will be based on common Spine/Leaf Data Center network architectures for the Kubernetes cluster. We propose one of today's most common implementations, which is based on layer-3 routing protocols, such as OSPF. Technological platforms such as overlay networks (VXLAN, GENEVE) or RDMA over Converged Ethernet version 2 (RoCEv2) add further capabilities to layer-3 networks.

     

     

    .

     

     

    Hardware Configuration

    Bill of Materials

    We propose the following two network solutions for the Mellanox Ethernet switch models below, depending on scale and blocking ratio considerations:

     

    Solution parts

    Solution 1 - Small Scale

    Solution 2 - Large Scale

    Spine switch

    SN2100 Open Ethernet Switch

    SN2700 Open Ethernet Switch

    Leaf switch

     

    SN2010 Ethernet Switch

    SN2410 Ethernet Switch

    Blocking ratio

    9:8

    3:1

    Max Nodes

    108

    672

    Nodes per Rack

    18

    48

    Max Racks614

    Network adapter

     

    1 per host

    ConnectX-5 Dual SFP28 Port

    Server-Leaf cables

     

    1 per host

    SFP28 25GbE Passive Copper Cable

    Leaf-Spine cables

     

    depends on solution and blocking ratio

    Mellanox MCP1600-E001 Passive Copper Cable IB EDR up to 100Gb/s QSFP LSZH 1m 30AWG

    QSFP28 100GbE Passive Copper Cable

     

     

    Physical diagrams for both solutions

    Please see below physical diagram for each solutions. In both solutions Spine switches are placed only in the first rack. We use NEO management software to provision, configure and monitor our network fabric as well as NEO-Host to configure and monitor the host network.

     

    Solution 1 - Small Scale

    Leaf-Spine topology: SN2100 as Spine and SN2010 as Leaf switch.

    This solution allows scaling of up to 108 nodes: 6 rack, 18 servers per rack with a 9:8 blocking ratio.

    Single 25GbE port connection from server to Leaf switch. SFP28 25GbE Passive Copper Cable used.

    4 x 100GbE ports connection from Leaf to Spine switches, 2 x 100GbE ports per Spine switch by QSFP28 100GbE Passive Copper Cables.

    Dedicated management port in each of the Mellanox switch connected to Switch Management Network.

     

     

     

    Solution 2 - Large Scale

    Leaf-Spine topology: SN2700 as Spine and SN2410 as Leaf switch.

    They allow scaling up to 672 nodes: 14 rack, 48 servers per rack with a 3:1 blocking ratio.

    Single 25GbE connection from server to Leaf switch. SFP28 25GbE Passive Copper Cable is used.

    4 x 100GbE connection from Leaf to Spine switches, 2 x 100GbE per Spine switch by QSFP28 100GbE Passive Copper Cables.

    Dedicated management port in each Mellanox switch connected to Switch Management Network.

     

     

    Mellanox NEO must have access to switch and host management networks in order to provision, operate and orchestrate End-2-End Ethernet fabric. NEO uses the 25GbE main host network interface to communicate with the NEO-Host agent (see physical networks diagrams below).

     

    In this document we do not cover connectivity to corporate network.

    We strongly recommend using out-of-band management for Mellanox switches. Use dedicated management port per each switch.

     

     

     

    Network Configuration

    NEO Virtual Appliance

    NEO software is available for download as CentOS/RedHat installation package as well as Virtual Appliances for various virtualization platforms. NEO Virtual Appliance is available in various file formats compatible with leading virtualization platforms, including VMware ESXi, Microsoft Hyper-V, Nutanix AHV, Red Hat Virtualization, IBM PowerKVM, and more.

     

    NEO Logical Schema

    Please see below logical connectivity schematics between all Mellanox SW and HW components. MOFED and NEO-HOST are the optional Mellanox software components for host installation.

     

     

     

    Downloading Mellanox NEO

    Mellanox NEO is available for download from the Mellanox NEO™ product page.

     

     

    Once filling out a short form, download instructions will be sent to you by email.

     

    Installing Virtual Appliance

    Please read the Mellanox NEO Quick Start Guide for detailed installation instructions. This Quick Start Guide provides step-by-step instructions for the Mellanox NEO™ software installation and Virtual Appliance deployment.

    In our example we use NEO Virtual Appliance which is installed on Microsoft Hyper-V platform.

    Once NEO VM is installed you can connect to the appliance console and use the following credentials (default) to log into your VM:

    • Username: root
    • Password: 123456

    Once logged in, you will see the following appliance information screen:

    The MAC address assigned to the VM must have a DHCP record in order to receive an IP address.

     

     

    Switch OS installation / configuration

    Please start from the How To Get Started with Mellanox switches guide if you are unfamiliar with Mellanox switch software. For more information please refer to the Mellanox Onyx User Manual located at support.mellanox.com or www.mellanox.com -> Switches. Before beginning to use the Mellanox switches, we recommend that you upgrade the switches to the latest Mellanox Onyx™ version. You can download this version from myMellanox - the Mellanox Support site (requires active support subscription).

     

    Fabric configuration

    In this guide the Ethernet switch fabric is configured as a Layer-3 Ethernet network. There are two ways to configure switches:

    • CLI-based configuration performed manually on each switch.
    • Wizard based configuration using Mellanox NEO.

           If you are not familiar with Mellanox NEO, please refer to Mellanox NEO Solutions.

     

     

     

    Example configuration of Mellanox NEO and Switch UI for Solution 2 - Large Scale

    This example shows multi-rack configuration connectivity of the two Leaf switches to both Spine switches. Each Leaf switch is configured with a single VLAN. Each Leaf switch is connected by 4 cables to both Spine switches - 2 cables per each Spine switch. Please see below cross-switch port connectivity table:

     

    Interface type

    Spine-1 switch

    Spine-2 switch

    Leaf-1 switch

    Leaf-2 switch

    OSPF

    Ports 1-2

     

    Ports 49-50

     

    Ports 1-2

    Ports 51-52

    OSPFPorts 3-4Ports 49-50
    Ports 3-4Ports 51-52

     

     

     

     

     

     

     

    You can easily extend it by add additional Leaf switch to OSPF area by connecting the corresponding Spine & Leaf interfaces. Below please see description on how to configure the Ethernet Switch Fabric by using Mellanox NEO and switch UI.

     

    1. Login to Mellanox NEO WEB UI using the following credentials(default):

    • Username: admin
    • Password: 123456

    Mellanox NEO URL can be found in the appliance console information screen

    2. Register devices.

         Register all switches via the "Add Devices" wizard in Managed Elements.

     

     

    3. Configuring Mellanox Onyx Switch for LLDP Discovery.

         Run Provisioning "Enable Link Layer Discovery..." from the Task tab on all switches.

     

    4. Creating OSPF Area with setup "L3 Network Provisioning":

         Add the "L3 Network Provisioning" service in the Virtual Modular Switch service section under the Service tab. Fill out the required fields in order to complete the wizard.

     

    5. Once configured, please review port status on each switch. All configured and connected ports must glow with green light.

     

     

    Server installation

     

    Ubuntu server ver. 16.04 is the chosen OS. Each server under this deployment gets its network settings from the DHCP server which distributes the network configuration parameters, such as IP addresses, DNS server addresses and server names. Here we use three physical servers: one master, and two workers. In this deployment we use the following servers:

    This document does not cover the server storage aspect. You should configure the server storage components in accordance with your intended use.

     

    Installing MOFED for Ubuntu (optional)

    This chapter describes the installation process of the MOFED Linux package on a single host machine.

    MOFED is an additional software component by Mellanox which provides the latest drivers and firmware versions.

    For more information click on Mellanox OFED for Linux User Manual.

     

    Downloading Mellanox OFED

     

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:

     

      2. Download the ISO image (according to your OS) into your servers share folder.

    The image name comes in the following format:
    MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
    http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

     

    3. Use the MD5SUM utility to confirm the integrity of the downloaded file. Run the following command and compare the result to the value provided on the download page.   

    # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

     

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. This installation script performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available in the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

     

    The installation script removes all previously installed Mellanox OFED packages and re-installs their new versions. You will be prompted to acknowledge the deletion of the old packages.

     

    1. Log into the installation machine as root.
    2. Copy the downloaded ISO to /root
    3. Mount the ISO image on your machine.
      # mkdir /mnt/iso
      # mount -o loop /share/MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64.iso /mnt/iso
      # cd /mnt/iso
    4. Run the installation script.
      # ./mlnxofedinstall
    5. Reboot after the installation completes successfully.
      # reboot

     

    Installing Mellanox NEO-Host (optional)

     

    Downloading Mellanox NEO-Host

    NEO-Host is available for download on MyMellanox.

    1. Log into MyMellanox.

    2. Go to Software -> Management Software -> Mellanox NEO-Host.

    3. Click “Downloads”.

    4. Download the software image.

     

    Unpacking and installing Mellanox NEO-Host

    To install NEO-Host:

    1. Copy the downloaded file to the /tmp directory:

    cp neohost-backend-<version>.tgz /tmp

    2. Untar the downloaded file:

    cd /tmp

    tar xvzf neohost-backend-<version>.tgz

    3. Run the Installation Script

    cd neohost-backend-<version>

    ./install-neohost.sh

     

     

    K8s Cluster Deployment Guide

     

    Deployments steps

     

    1. On each server install Docker:

    # apt-get update && apt-get -y install linux-generic-hwe-16.04

    # apt-get -y upgrade && apt-get -y install apt-transport-https

    # apt-get -y install docker.io=17.03.2-0ubuntu2~16.04.1

    # systemctl start docker

    # systemctl enable docker

    2. On each server install Kubernetes apt repository and components:

    # curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -

    # echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list

    # apt-get update && apt-get install -y kubelet kubeadm kubectl kubernetes-cni

    3. Disable swap:

     

    # swapoff –a

     

    4. Initialize master node:

    # kubeadm init

     

       Once complete, you will be presented with the exact command “kubeadm join …” that you need to execute on each worker node in order to join it as master.

       Before you join a node, you need to configure your environment:

    # mkdir -p $HOME/.kube

    # sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

    # sudo chown $(id -u):$(id -g) $HOME/.kube/config

     

         Alternatively, if you are the root user, you can run this command:

    # export KUBECONFIG=/etc/kubernetes/admin.conf

     

    5. Pod network add-on selection:

         At the end of kubeadm initialization You must choose a Pod network add-on in order to establish communication between deployed pods.

         In this deployment we choose Contiv as the Pod network add-on.

         Contiv supports CNI based Kubernetes networking architecture. It consists of two major components:

      • Netmaster
      • Netplugin (Contiv Host Agent)

         The following Contiv architectural diagram shows how Netmaster and Netplugin provide the Contiv solution (Diagram credit: Contiv):

      

    6. Deploying a Contiv pod network add-on with VXLAN overlay:

         In our deployment we use the same network interface for the control and data planes. In this example we use the Contiv-1.1.7 installer version.

         Please see here Contiv installation steps on Master Node:

      • Download installer bundle
        curl -L -O https://github.com/contiv/install/releases/download/1.1.7/contiv-1.1.7.tgz
      • Extract installer bundle
        tar oxf contiv-1.1.7.tgz
      • Change directories to the extracted folder
        cd contiv-1.1.7
      • Install Contiv with the VXLAN overlay
        #./install/k8s/install.sh -n 10.215.15.1
        10.215.15.1 - IP address of Master node
      • After installation is complete you must review the final outcome of this process.
        Installation is complete
        =========================================================

        Contiv UI is available at https://10.215.15.1:10000
        Please use the first run wizard or configure the setup as follows:
          Configure forwarding mode (optional, default is routing).
           netctl global set --fwd-mode routing
           Configure ACI mode (optional)
           netctl global set --fabric-mode aci --vlan-range <start>-<end>
           Create a default network
           netctl net create -t default --subnet=<CIDR> default-net
           For example, netctl net create -t default --subnet=20.1.1.0/24 -g 20.1.1.1 default-net

        =========================================================

         

      • Create default network with netctl - a command line client for Contiv netplugin.
        netctl net create -t default --subnet=20.1.1.0/24 -g 20.1.1.1 default-net

                   netctl is a utility used for creating, reading and modifying Contiv objects.

     

              More information about integrating Contiv into Kubernetes can be found at Tutorials - Contiv.

     

    7. Joining a worker node:

         Please execute on each worker node the “kubeadm join …” command specified above.

     

    8. Check your deployment:

         Please execute the following command on the master node:

    # kubectl get nodes -o wide

     

    The command output should look like the following:

     

    NAME           STATUS    ROLES     AGE       VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME

    clx-host-020   Ready     master    7d        v1.11.2   10.215.15.1   <none>        Ubuntu 16.04.5 LTS   4.15.0-32-generic   docker://17.3.2

    clx-host-021   Ready     <none>    7d        v1.11.2   10.215.15.2   <none>        Ubuntu 16.04.5 LTS   4.15.0-32-generic   docker://17.3.2

    clx-host-022   Ready     <none>    7d        v1.11.2   10.215.16.1   <none>        Ubuntu 16.04.5 LTS   4.15.0-32-generic   docker://17.3.2

     

    As specified above, each Worker node is connected to a different Leaf switch.

     

    Checking Contiv and related services:

    # kubectl get pod -o wide -n kube-system -o wide

    NAME                                   READY     STATUS    RESTARTS   AGE       IP            NODE           NOMINATED NODE

    contiv-etcd-nvcww                      1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

    contiv-netmaster-frkjs                 3/3       Running   0          7d        10.215.15.1   clx-host-020   <none>

    contiv-netplugin-pq2dq                 2/2       Running   0          7d        10.215.15.2   clx-host-021   <none>

    contiv-netplugin-rtch5                 2/2       Running   0          7d        10.215.15.1   clx-host-020   <none>

    contiv-netplugin-zb8hx                 2/2       Running   0          7d        10.215.16.1   clx-host-022   <none>

    etcd-clx-host-020                      1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

    kube-apiserver-clx-host-020            1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

    kube-controller-manager-clx-host-020   1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

    kube-proxy-82v8p                       1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

    kube-proxy-cs929                       1/1       Running   0          7d        10.215.16.1   clx-host-022   <none>

    kube-proxy-j478c                       1/1       Running   0          7d        10.215.15.2   clx-host-021   <none>

    kube-scheduler-clx-host-020            1/1       Running   0          7d        10.215.15.1   clx-host-020   <none>

     

    Contiv-etcd, contiv-netmaster and contiv-netplugin will appear in Running status.

     

    Checking created tenants and available networks:

    # netctl net ls -a

    Tenant   Network      Nw Type  Encap type  Packet tag  Subnet        Gateway    IPv6Subnet  IPv6Gateway  Cfgd Tag

    ------   -------      -------  ----------  ----------  -------       ------     ----------  -----------  ---------

    default  contivh1     infra    vxlan       0           132.1.1.0/24  132.1.1.1                          

    default  default-net  data     vxlan       0           20.1.1.0/24   20.1.1.1 

     

     

    # netctl net inspect contivh1

    {

      "Config": {

        "key": "default:contivh1",

        "encap": "vxlan",

        "gateway": "132.1.1.1",

        "networkName": "contivh1",

        "nwType": "infra",

        "subnet": "132.1.1.0/24",

        "tenantName": "default",

        "link-sets": {},

        "links": {

          "Tenant": {

            "type": "tenant",

            "key": "default"

          }

        }

      },

      "Oper": {

        "allocatedAddressesCount": 3,

        "allocatedIPAddresses": "132.1.1.1-132.1.1.4",

        "availableIPAddresses": "132.1.1.5-132.1.1.254",

        "endpoints": [

          {

            "endpointID": "clx-host-021",

            "homingHost": "clx-host-021",

            "ipAddress": [

              "132.1.1.2",

              ""

            ],

            "labels": "map[]",

            "macAddress": "02:02:84:01:01:02",

            "network": "contivh1.default"

          },

          {

            "endpointID": "clx-host-020",

            "homingHost": "clx-host-020",

            "ipAddress": [

              "132.1.1.3",

              ""

            ],

            "labels": "map[]",

            "macAddress": "02:02:84:01:01:03",

            "network": "contivh1.default"

          },

          {

            "endpointID": "clx-host-022",

            "homingHost": "clx-host-022",

            "ipAddress": [

              "132.1.1.4",

              ""

            ],

            "labels": "map[]",

            "macAddress": "02:02:84:01:01:04",

            "network": "contivh1.default"

          }

        ],

        "externalPktTag": 1,

        "networkTag": "contivh1.default",

        "numEndpoints": 3,

        "pktTag": 1

      }

    }

     

    Routing configuration:

    # ip route

    default via 10.215.15.254 dev ens2f0 onlink

    10.96.0.0/12 via 132.1.1.3 dev contivh1

    10.215.15.0/24 dev ens2f0  proto kernel  scope link  src 10.215.15.1

    20.1.1.0/24 via 132.1.1.3 dev contivh1

    132.1.1.0/24 dev contivh1  proto kernel  scope link  src 132.1.1.3

    172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown

    172.19.0.0/16 dev contivh0  proto kernel  scope link  src 172.19.255.254

    Contiv uses the contivh0 interface as the host port for routing external traffic. It adds a post routing rule to iptables on the host in order to masquerade traffic coming through contivh0.

    The contivh1 interface allows the host to access the container/pod networks in routing mode.

    Using multiple tenants:

     

    In order to run pods belonging the specified tenant, network, and endpoint group please use the io.contiv.tenant, io.contiv.network and io.contiv.net-group labels respectively in YAML file configuration.

    Checking connectivity between pods and from pod to external network:

    We deployed the IPERF3 pod with two replicas (please see below YAML example):

    # kubectl apply -f 2pod-iperf3.yaml

     

         Show deployed pods:

    # kubectl get pod -l app=iperf3 -o wide

    NAME                              READY     STATUS    RESTARTS   AGE       IP         NODE           NOMINATED NODE

    dep-iperf3-2pod-9c8d98db8-5m8xj   1/1       Running   0          7d        20.1.1.2   clx-host-022   <none>

    dep-iperf3-2pod-9c8d98db8-kjhbf   1/1       Running   0          7d        20.1.1.3   clx-host-021   <none>

     

         Checking connectivity of second pod:

    # kubectl exec -it dep-iperf3-2pod-9c8d98db8-5m8xj -- ping -c 3 20.1.1.3

    PING 20.1.1.3 (20.1.1.3): 56 data bytes

    64 bytes from 20.1.1.3: icmp_seq=0 ttl=64 time=1.868 ms

    64 bytes from 20.1.1.3: icmp_seq=1 ttl=64 time=0.319 ms

    64 bytes from 20.1.1.3: icmp_seq=2 ttl=64 time=0.253 ms

    --- 20.1.1.3 ping statistics ---

    3 packets transmitted, 3 packets received, 0% packet loss

    round-trip min/avg/max/stddev = 0.253/0.813/1.868/0.746 ms

         Ping external server (Google's DNS server):

    # kubectl exec -it dep-iperf3-2pod-9c8d98db8-5m8xj -- ping -c 3 8.8.8.8

    PING 8.8.8.8 (8.8.8.8): 56 data bytes

    64 bytes from 8.8.8.8: icmp_seq=0 ttl=112 time=63.560 ms

    64 bytes from 8.8.8.8: icmp_seq=1 ttl=112 time=62.350 ms

    64 bytes from 8.8.8.8: icmp_seq=2 ttl=112 time=62.310 ms

    --- 8.8.8.8 ping statistics ---

    3 packets transmitted, 3 packets received, 0% packet loss

    round-trip min/avg/max/stddev = 62.310/62.740/63.560/0.580 ms

     

    Benchmarks

     

    To compare the performance, we ran TCP throughput the benchmarks with iperf3 between two physical nodes and two containers which deployed on each node.

    Please see below benchmark results between two worker nodes and two containers.

     

    Throughput test mode
    Single stream throughput, Gbits/s
    Host19.0
    Container11.0

     

     

     

     

     

    See below configuration file for containers deployment - 2pod-iperf3.yaml:

     

    apiVersion: apps/v1

    kind: Deployment

    metadata:

      name: dep-iperf3-2pod

      labels:

        app: iperf3

    spec:

      replicas: 2

      selector:

        matchLabels:

          app: iperf3

      template:

        metadata:

          labels:

            app: iperf3

        spec:

          containers:

          - image: networkstatic/iperf3

            name: app-iperf3

            command:

              - sh

              - -c

              - |

                sleep 1000000