In this document we will demonstrate a multi-rack deployment of the Kubernetes cluster over Mellanox end-to-end Ethernet network. We will use bare-metal servers running Ubuntu 16.04, ConnectX-5 family NICs, Spectrum family switches and LinkX product family cables. NEOTM – Mellanox Network Orchestration and Management Software will provision and operate Leaf-Spine Ethernet fabric including switches, NIC's and cables.
- Solution Design
- Hardware Configuration
- Network Configuration
- Server installation
- K8s Cluster Deployment Guide
- Mellanox Scale-Out Open Ethernet Products
- Mellanox Onyx™ Advanced Ethernet Operating System
- Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
What is Kubernetes?
Kubernetes is an open source system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
Mellanox components overview and benefits
Mellanox Spectrum switch family provides the most efficient network solutions for the ever-increasing performance demands of Data Center applications. Mellanox ConnectX network adapter family delivers industry-leading connectivity for performance-driven server and storage applications. These ConnectX adapter cards enable high bandwidth, coupled with ultra-low latency for diverse applications and systems, resulting in faster access and real-time responses. Mellanox NEO™ is a powerful platform for managing computing networks. It enables data center operators to efficiently provision, monitor and operate the modern data center fabric. Mellanox NEO-Host is a powerful solution for orchestration and management of host networking. NEO-Host is integrated with the Mellanox NEO™ and can be deployed on Linux hosts managed by NEO.
- Mellanox Scale-Out SN2000 Ethernet Switch Series
- ConnectX®-5 EN Adapters Supporting 100Gb/s Ethernet
- LinkX® Ethernet Cables and Transceivers
- Mellanox NEO™
Our reference deployment will be based on common Spine/Leaf Data Center network architectures for the Kubernetes cluster. We propose one of today's most common implementations, which is based on layer-3 routing protocols, such as OSPF. Technological platforms such as overlay networks (VXLAN, GENEVE) or RDMA over Converged Ethernet version 2 (RoCEv2) add further capabilities to layer-3 networks.
Bill of Materials
We propose the following two network solutions for the Mellanox Ethernet switch models below, depending on scale and blocking ratio considerations:
Solution 1 - Small Scale
Solution 2 - Large Scale
SN2100 Open Ethernet Switch
SN2010 Ethernet Switch
Nodes per Rack
1 per host
ConnectX-5 Dual SFP28 Port
1 per host
SFP28 25GbE Passive Copper Cable
depends on solution and blocking ratio
QSFP28 100GbE Passive Copper Cable
Physical diagrams for both solutions
Please see below physical diagram for each solutions. In both solutions Spine switches are placed only in the first rack. We use NEO management software to provision, configure and monitor our network fabric as well as NEO-Host to configure and monitor the host network.
Solution 1 - Small Scale
Leaf-Spine topology: SN2100 as Spine and SN2010 as Leaf switch.
This solution allows scaling of up to 108 nodes: 6 rack, 18 servers per rack with a 9:8 blocking ratio.
Single 25GbE port connection from server to Leaf switch. SFP28 25GbE Passive Copper Cable used.
4 x 100GbE ports connection from Leaf to Spine switches, 2 x 100GbE ports per Spine switch by QSFP28 100GbE Passive Copper Cables.
Dedicated management port in each of the Mellanox switch connected to Switch Management Network.
Solution 2 - Large Scale
Leaf-Spine topology: SN2700 as Spine and SN2410 as Leaf switch.
They allow scaling up to 672 nodes: 14 rack, 48 servers per rack with a 3:1 blocking ratio.
Single 25GbE connection from server to Leaf switch. SFP28 25GbE Passive Copper Cable is used.
4 x 100GbE connection from Leaf to Spine switches, 2 x 100GbE per Spine switch by QSFP28 100GbE Passive Copper Cables.
Dedicated management port in each Mellanox switch connected to Switch Management Network.
Mellanox NEO must have access to switch and host management networks in order to provision, operate and orchestrate End-2-End Ethernet fabric. NEO uses the 25GbE main host network interface to communicate with the NEO-Host agent (see physical networks diagrams below).
In this document we do not cover connectivity to corporate network.
We strongly recommend using out-of-band management for Mellanox switches. Use dedicated management port per each switch.
NEO Virtual Appliance
NEO software is available for download as CentOS/RedHat installation package as well as Virtual Appliances for various virtualization platforms. NEO Virtual Appliance is available in various file formats compatible with leading virtualization platforms, including VMware ESXi, Microsoft Hyper-V, Nutanix AHV, Red Hat Virtualization, IBM PowerKVM, and more.
NEO Logical Schema
Please see below logical connectivity schematics between all Mellanox SW and HW components. MOFED and NEO-HOST are the optional Mellanox software components for host installation.
Downloading Mellanox NEO
Mellanox NEO is available for download from the Mellanox NEO™ product page.
Once filling out a short form, download instructions will be sent to you by email.
Installing Virtual Appliance
Please read the Mellanox NEO Quick Start Guide for detailed installation instructions. This Quick Start Guide provides step-by-step instructions for the Mellanox NEO™ software installation and Virtual Appliance deployment.
Once NEO VM is installed you can connect to the appliance console and use the following credentials (default) to log into your VM:
- Username: root
- Password: 123456
Once logged in, you will see the following appliance information screen:
The MAC address assigned to the VM must have a DHCP record in order to receive an IP address.
Switch OS installation / configuration
Please start from the How To Get Started with Mellanox switches guide if you are unfamiliar with Mellanox switch software. For more information please refer to the Mellanox Onyx User Manual located at support.mellanox.com or www.mellanox.com -> Switches. Before beginning to use the Mellanox switches, we recommend that you upgrade the switches to the latest Mellanox Onyx™ version. You can download this version from myMellanox - the Mellanox Support site (requires active support subscription).
In this guide the Ethernet switch fabric is configured as a Layer-3 Ethernet network. There are two ways to configure switches:
- CLI-based configuration performed manually on each switch.
- Wizard based configuration using Mellanox NEO.
If you are not familiar with Mellanox NEO, please refer to Mellanox NEO Solutions.
Example configuration of Mellanox NEO and Switch UI for Solution 2 - Large Scale
This example shows multi-rack configuration connectivity of the two Leaf switches to both Spine switches. Each Leaf switch is configured with a single VLAN. Each Leaf switch is connected by 4 cables to both Spine switches - 2 cables per each Spine switch. Please see below cross-switch port connectivity table:
|OSPF||Ports 3-4||Ports 49-50|
|Ports 3-4||Ports 51-52|
You can easily extend it by add additional Leaf switch to OSPF area by connecting the corresponding Spine & Leaf interfaces. Below please see description on how to configure the Ethernet Switch Fabric by using Mellanox NEO and switch UI.
1. Login to Mellanox NEO WEB UI using the following credentials(default):
- Username: admin
- Password: 123456
Mellanox NEO URL can be found in the appliance console information screen
2. Register devices.
Register all switches via the "Add Devices" wizard in Managed Elements.
3. Configuring Mellanox Onyx Switch for LLDP Discovery.
Run Provisioning "Enable Link Layer Discovery..." from the Task tab on all switches.
4. Creating OSPF Area with setup "L3 Network Provisioning":
Add the "L3 Network Provisioning" service in the Virtual Modular Switch service section under the Service tab. Fill out the required fields in order to complete the wizard.
5. Once configured, please review port status on each switch. All configured and connected ports must glow with green light.
Ubuntu server ver. 16.04 is the chosen OS. Each server under this deployment gets its network settings from the DHCP server which distributes the network configuration parameters, such as IP addresses, DNS server addresses and server names. Here we use three physical servers: one master, and two workers. In this deployment we use the following servers:
This document does not cover the server storage aspect. You should configure the server storage components in accordance with your intended use.
Installing MOFED for Ubuntu (optional)
This chapter describes the installation process of the MOFED Linux package on a single host machine.
MOFED is an additional software component by Mellanox which provides the latest drivers and firmware versions.
For more information click on Mellanox OFED for Linux User Manual.
Downloading Mellanox OFED
- Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
# lspci -v | grep MellanoxThe following example shows a system with an installed Mellanox HCA:
2. Download the ISO image (according to your OS) into your servers share folder.
The image name comes in the following format:
MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.
3. Use the MD5SUM utility to confirm the integrity of the downloaded file. Run the following command and compare the result to the value provided on the download page.
# md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso
Installing Mellanox OFED
MLNX_OFED is installed by running the mlnxofedinstall script. This installation script performs the following:
- Discovers the currently installed kernel
- Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
- Installs the MLNX_OFED_LINUX binary RPMs (if they are available in the current kernel)
- Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware
The installation script removes all previously installed Mellanox OFED packages and re-installs their new versions. You will be prompted to acknowledge the deletion of the old packages.
- Log into the installation machine as root.
- Copy the downloaded ISO to /root
- Mount the ISO image on your machine.
# mkdir /mnt/iso
# mount -o loop /share/MLNX_OFED_LINUX-4.2-18.104.22.168-ubuntu16.04-x86_64.iso /mnt/iso
# cd /mnt/iso
- Run the installation script.
- Reboot after the installation completes successfully.
Installing Mellanox NEO-Host (optional)
Downloading Mellanox NEO-Host
NEO-Host is available for download on MyMellanox.
1. Log into MyMellanox.
2. Go to Software -> Management Software -> Mellanox NEO-Host.
3. Click “Downloads”.
4. Download the software image.
Unpacking and installing Mellanox NEO-Host
To install NEO-Host:
1. Copy the downloaded file to the /tmp directory:
cp neohost-backend-<version>.tgz /tmp
2. Untar the downloaded file:
tar xvzf neohost-backend-<version>.tgz
3. Run the Installation Script
K8s Cluster Deployment Guide
1. On each server install Docker:
# apt-get update && apt-get -y install linux-generic-hwe-16.04
# apt-get -y upgrade && apt-get -y install apt-transport-https
# apt-get -y install docker.io=17.03.2-0ubuntu2~16.04.1
# systemctl start docker
# systemctl enable docker
2. On each server install Kubernetes apt repository and components:
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
# echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
# apt-get update && apt-get install -y kubelet kubeadm kubectl kubernetes-cni
3. Disable swap:
# swapoff –a
4. Initialize master node:
# kubeadm init
Once complete, you will be presented with the exact command “kubeadm join …” that you need to execute on each worker node in order to join it as master.
Before you join a node, you need to configure your environment:
# mkdir -p $HOME/.kube
# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
# sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run this command:
# export KUBECONFIG=/etc/kubernetes/admin.conf
5. Pod network add-on selection:
At the end of kubeadm initialization You must choose a Pod network add-on in order to establish communication between deployed pods.
In this deployment we choose Contiv as the Pod network add-on.
Contiv supports CNI based Kubernetes networking architecture. It consists of two major components:
- Netplugin (Contiv Host Agent)
The following Contiv architectural diagram shows how Netmaster and Netplugin provide the Contiv solution (Diagram credit: Contiv):
6. Deploying a Contiv pod network add-on with VXLAN overlay:
In our deployment we use the same network interface for the control and data planes. In this example we use the Contiv-1.1.7 installer version.
Please see here Contiv installation steps on Master Node:
- Download installer bundle
- Extract installer bundle
tar oxf contiv-1.1.7.tgz
- Change directories to the extracted folder
- Install Contiv with the VXLAN overlay
#./install/k8s/install.sh -n 10.215.15.110.215.15.1 - IP address of Master node
- After installation is complete you must review the final outcome of this process.
Installation is complete
Contiv UI is available at https://10.215.15.1:10000
Please use the first run wizard or configure the setup as follows:
Configure forwarding mode (optional, default is routing).
netctl global set --fwd-mode routing
Configure ACI mode (optional)
netctl global set --fabric-mode aci --vlan-range <start>-<end>
Create a default network
netctl net create -t default --subnet=<CIDR> default-net
For example, netctl net create -t default --subnet=22.214.171.124/24 -g 126.96.36.199 default-net
- Create default network with netctl - a command line client for Contiv netplugin.
netctl net create -t default --subnet=188.8.131.52/24 -g 184.108.40.206 default-net
netctl is a utility used for creating, reading and modifying Contiv objects.
More information about integrating Contiv into Kubernetes can be found at Tutorials - Contiv.
7. Joining a worker node:
Please execute on each worker node the “kubeadm join …” command specified above.
8. Check your deployment:
Please execute the following command on the master node:
# kubectl get nodes -o wide
The command output should look like the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
clx-host-020 Ready master 7d v1.11.2 10.215.15.1 <none> Ubuntu 16.04.5 LTS 4.15.0-32-generic docker://17.3.2
clx-host-021 Ready <none> 7d v1.11.2 10.215.15.2 <none> Ubuntu 16.04.5 LTS 4.15.0-32-generic docker://17.3.2
clx-host-022 Ready <none> 7d v1.11.2 10.215.16.1 <none> Ubuntu 16.04.5 LTS 4.15.0-32-generic docker://17.3.2
As specified above, each Worker node is connected to a different Leaf switch.
Checking Contiv and related services:
# kubectl get pod -o wide -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
contiv-etcd-nvcww 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
contiv-netmaster-frkjs 3/3 Running 0 7d 10.215.15.1 clx-host-020 <none>
contiv-netplugin-pq2dq 2/2 Running 0 7d 10.215.15.2 clx-host-021 <none>
contiv-netplugin-rtch5 2/2 Running 0 7d 10.215.15.1 clx-host-020 <none>
contiv-netplugin-zb8hx 2/2 Running 0 7d 10.215.16.1 clx-host-022 <none>
etcd-clx-host-020 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
kube-apiserver-clx-host-020 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
kube-controller-manager-clx-host-020 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
kube-proxy-82v8p 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
kube-proxy-cs929 1/1 Running 0 7d 10.215.16.1 clx-host-022 <none>
kube-proxy-j478c 1/1 Running 0 7d 10.215.15.2 clx-host-021 <none>
kube-scheduler-clx-host-020 1/1 Running 0 7d 10.215.15.1 clx-host-020 <none>
Contiv-etcd, contiv-netmaster and contiv-netplugin will appear in Running status.
Checking created tenants and available networks:
# netctl net ls -a
Tenant Network Nw Type Encap type Packet tag Subnet Gateway IPv6Subnet IPv6Gateway Cfgd Tag
------ ------- ------- ---------- ---------- ------- ------ ---------- ----------- ---------
default contivh1 infra vxlan 0 220.127.116.11/24 18.104.22.168
default default-net data vxlan 0 22.214.171.124/24 126.96.36.199
# netctl net inspect contivh1
# ip route
default via 10.215.15.254 dev ens2f0 onlink
10.96.0.0/12 via 188.8.131.52 dev contivh1
10.215.15.0/24 dev ens2f0 proto kernel scope link src 10.215.15.1
184.108.40.206/24 via 220.127.116.11 dev contivh1
18.104.22.168/24 dev contivh1 proto kernel scope link src 22.214.171.124
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.19.0.0/16 dev contivh0 proto kernel scope link src 172.19.255.254
Contiv uses the contivh0 interface as the host port for routing external traffic. It adds a post routing rule to iptables on the host in order to masquerade traffic coming through contivh0.
The contivh1 interface allows the host to access the container/pod networks in routing mode.
Using multiple tenants:
In order to run pods belonging the specified tenant, network, and endpoint group please use the io.contiv.tenant, io.contiv.network and io.contiv.net-group labels respectively in YAML file configuration.
Checking connectivity between pods and from pod to external network:
We deployed the IPERF3 pod with two replicas (please see below YAML example):
# kubectl apply -f 2pod-iperf3.yaml
Show deployed pods:
# kubectl get pod -l app=iperf3 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
dep-iperf3-2pod-9c8d98db8-5m8xj 1/1 Running 0 7d 126.96.36.199 clx-host-022 <none>
dep-iperf3-2pod-9c8d98db8-kjhbf 1/1 Running 0 7d 188.8.131.52 clx-host-021 <none>
Checking connectivity of second pod:
# kubectl exec -it dep-iperf3-2pod-9c8d98db8-5m8xj -- ping -c 3 184.108.40.206
PING 220.127.116.11 (18.104.22.168): 56 data bytes
64 bytes from 22.214.171.124: icmp_seq=0 ttl=64 time=1.868 ms
64 bytes from 126.96.36.199: icmp_seq=1 ttl=64 time=0.319 ms
64 bytes from 188.8.131.52: icmp_seq=2 ttl=64 time=0.253 ms
--- 184.108.40.206 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.253/0.813/1.868/0.746 ms
Ping external server (Google's DNS server):
# kubectl exec -it dep-iperf3-2pod-9c8d98db8-5m8xj -- ping -c 3 220.127.116.11
PING 18.104.22.168 (22.214.171.124): 56 data bytes
64 bytes from 126.96.36.199: icmp_seq=0 ttl=112 time=63.560 ms
64 bytes from 188.8.131.52: icmp_seq=1 ttl=112 time=62.350 ms
64 bytes from 184.108.40.206: icmp_seq=2 ttl=112 time=62.310 ms
--- 220.127.116.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 62.310/62.740/63.560/0.580 ms
To compare the performance, we ran TCP throughput the benchmarks with iperf3 between two physical nodes and two containers which deployed on each node.
Please see below benchmark results between two worker nodes and two containers.
|Throughput test mode||Single stream throughput, Gbits/s|
See below configuration file for containers deployment - 2pod-iperf3.yaml:
- image: networkstatic/iperf3