VXLAN Considerations for ConnectX-3 Pro

Version 19

    This document discuss VXLAN considerations in general and specifically when using Mellanox ConnectX-3 Pro adapter cards.

     

    References:


    What is VXLAN?

    Virtual Extensible LAN (VXLAN) is a network virtualization technology that improve the scalability problems associated with large cloud computing deployments. It tunnels Ethernet frames within Ethernet + IP + UDP frames.

     

    It  enables creating millions of virtual L2 networks over traditional IP networks

    • Can serve tenants in a cloud provider infrastructure

    It can span local or wide area networks

    • Can migrate the entire network between cloud providers/sites and in case of a disaster
    • Can create logical L2 networks which span multiple locations (similar to VPNs)

    It can cross routers

    • Leverage L3 network scalability and protocols (OSPF, BGP, ECMP)


    VXLAN packet encapsulation structure:

    28.png

     

     

     

    VXLAN overlay performance challenge:

    Hypervisor IP stack and standard NICs are not aware of the client TCP/IP traffic.

    The common offload hardware techniques such as:

    • TCP segmentation/re-assembly
    • RX/TX checksum offload
    • CPU core scaling (RSS/TSS)

    do not operate on the VM TCP/IP packets (inner payload) which leads to significant CPU overhead and much lower performance.

    Mellanox ConnectX-3 Pro adapter card offloads those tasks to the hardware leading to significant CPU overhead reduction and greater performance.


    Typical Architecture and Deployment:

    The following figure consist the layout of three tenants colored with red, blue and green.

    • There are three nodes, the two on the right act as hypervisors, the left one acts as gateway (network) node.
    • Each VM running over the hypervisor can be connected to one or more networks.
    • Some of the networks may have external interface via a router or network node.
    • The network management can be done via OpenStack Neutron, OpenDL, VMware NSX or similar tools.

    27.png


    Para-Virtualized (PV) connectivity scheme between the VM and the host network stack on hypervisor node

    • The VM runs the front-end PV Ethernet driver (e.g KVM's virtio-net), where the back-end (e.g KVM's vhost), runs on the  hypervisor kernel.
    • The backend driver communicates with tap device instance which serves this VM.
    • The tap instance is plugged into an OVS kernel datapath (DP) and acts as virtual port.
    • The OVS port that performs the VXLAN encapsulation/decapsulation is added to the datapath.
    • The user-space OVS daemon programs DP flow-rules which cause the VM traffic to be forwarded into/from the VXLAN port.
    • The  OVS VXLAN port sends the encapsulated packet down to the UDP/IP stack, which further sends the packet to the NIC driver (and the other way around).
    • The NIC applies TCP stateless offloads (RX/TX checksum, Large-Send, RSS, etc) on the VXLAN traffic
    • Refer to HowTo Configure VXLAN for ConnectX-3 Pro (Linux OVS) for additional examples and configuration options.

     

    Legacy connectivity scheme on gateway (network) node

    • An OVS datapath instance is created and VXLAN port is added as in the hypervisor case.
    • A veth (Virtual Ethernet) NIC pair is created, this NIC pair will serve as a channel within the network node to move packet from the router module to the OVS DP instance and vice versa.
    • One of the veth devices is added as virtual port to the OVS datapath.
    • The other veth device is used by the router application.
    • The NIC applies TCP stateless offloads (RX/TX checksum, Large-Send/Receive, RSS, etc) on the VXLAN traffic.

     

    DPDK connectivity scheme on gateway (network) node

    • User-space DPDK application hooks all packets received on the NIC.
    • The NIC HW applies TCP stateless offloads (RX/TX checksum, Large-Send/Receive, RSS, etc) on the VXLAN traffic.
    • The application applies decapsulation and sends the packet to the outer network.
    • On the other direction, the application applies encapsulation and sends the packet to the inner network through DPDK

     

    PV connectivity scheme between the VM and hypervisor for the case of user-space DPDK offloaded OVS datapath

    • The VM runs the front-end PV Ethernet driver (e.g KVM's virtio-net).
    • The back-end PV driver (e.g KVM's vhost) runs within a user-space process on the hypervisor.
    • The back-end driver communicates with OVS user-space datapath (DP) which is implemented over DPDK.
    • The OVS port that performs the VXLAN encapsulation/decapsulation is added to the datapath.
    • The user-space OVS daemon programs DP flow-rules which cause the VM traffic to be forwarded into/from the VXLAN port.
    • The  OVS VXLAN port sends the encapsulated packet through the DPDK API and the Mellanox Poll-Mode-Driver (PMD).
    • The NIC HW applies TCP stateless offloads (RX/TX checksum, Large-Send, RSS, etc) on the VXLAN traffic
    • This configuration includes elements (user space vhost and OVS datapath) which are currently under development by the open-source community.


    Legacy connectivity scheme on gateway (network) node

     

    L3 destination learning challenge:

    By given a guest Ethernet frame, how does the hypervisor find the destination attributes (Dest IP and MAC), this problem is commonly called "L3 miss".

    There are two ways to solve this problem:

    1. use IP multicast to locate the remove hypervisor.

    2. use static route tables (proactively filled OVS tables that maps guest DMAC to remove Dest IP.

     

    IP multicast is not popular in most networks (requires PIM to run on the routers) and most virtualization systems have a solution that doesn't involve multicast for example:

    • Open-Stack Controller with Neutron L2 population technique
    • Open-Stack Controller talking with ODL (Open-Day-Light) controller using Open-Flow / OVSDB protocols
    • IBM DOVE (for ESX and Linux)
    • Nicira NSX (for ESX)

     

    Linux Implementation details:

    • Supported under Linux OVS (Open vSwitch) and Linux Bridge.
    • Can serve VMs (via tap/para-virtualization) or Gateways (via the Hypervisor and vEth interface)
    • Mellanox ConnectX3-Pro upstream drivers support the native Linux hardware offloaded tunneling APIs.
    • Mellanox and its partners fixed/enhanced several kernel modules (UDP offload, OVS, GRO ..) to maximize the user benefit, those fixes are available in upstream kernel 3.14 and above.
    • The overall solution is provided by RHEL 7, Ubuntu 14.04 ro SLES 12 (inbox kernel)

     

    Performance:

     

    Configuration:

    Host: Supermicro SuperServer  SYS-6027R-TRF

    CPU: Intel(R) Xeon(R) CPU E5-2697@ 2.7GHz

    Number of cores: 24

    PCIe: Gen3

    Width: x8

    OS Distribution: RHEL7.0 kernel 3.10.0-105.el7.x86_64

    VM (Guest OS): RHEL6.4

    VM MTU: 1450

    Mellanox 40GbE Adapter P/N: MCX354A-FCCT

    Driver version: MLNX_OFED_LINUX-2.2-0.0.6

    ConnectX Firmware: 2.31.2600

     

    30.png

     

    VXLAN Detailed Packet format:

    26.pngT