Reference Deployment Guide for Red Hat OpenStack Platform 13 Over Mellanox 100Gb Ethernet Solutions with ASAP2 OVS HW Acceleration for VXLAN Traffic

Version 38

     

    References

     

    Introduction

    Solution Overview

    The Red Hat OpenStack Platform (OSP) is a production-ready foundation that enables the creation, deployment, scale, and management of a secure and reliable public or private OpenStack-based cloud. As in many OpenStack cloud deployments today, Open vSwitch (OVS) is one of the most popular virtual switch platforms.

    OVS hardware offload is enabling acceleration of the data path (fast-path) for high-throughput flows along with the unmodified standard OVS control path for flexibility and programming of match-action rules.

     

    The OSP13 release introduces Open vSwitch (OVS) HW offload technologies as Technology Preview features.

    Mellanox Accelerated Switching and Packet Processing (ASAP2) is the Mellanox OVS HW offload implementation fully supported by RH-OSP13. ASAP2 completely and transparently offloads networking functions such as overlays, routing, security and load balancing to the adapter’s embedded switch (e-switch), achieving a tremendous performance boost in terms of higher packet throughput and lower latencies, and improvement in cloud efficiency.

     

    In the guide below, we will introduce an RH-OSP13 solution utilizing OVS acceleration based on ASAP2 technology.

     

    Mellanox Components: Overview and Benefits

    • Mellanox Spectrum Switch family provides the most efficient network solutions for the ever-increasing performance demands of data center applications.
    • Mellanox ConnectX Network Adapter family delivers industry-leading connectivity for performance-driven server and storage applications. ConnectX adapter cards enable high bandwidth, coupled with ultra-low latency for diverse applications and systems, resulting in faster access and real-time responses.
    • Mellanox Accelerated Switching and Packet Processing (ASAP2) technology combines the performance and efficiency of server/storage networking hardware with the flexibility of virtual switching software. ASAP2 offers up to 10 times better performance than non-offloaded OVS solutions, delivering software-defined networks with the highest total infrastructure efficiency, deployment flexibility and operational simplicity. (Introduced starting in ConnectX-4 Lx NICs.)
    • Mellanox NEO™ is a powerful platform for managing computing networks. It enables data center operators to efficiently provision, monitor and operate the modern data center fabric.
    • Mellanox LinkX Cables and Transceivers family provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400Gb interconnect products for Cloud, Web 2.0, Enterprise, telco, and storage data center applications. They are often used to link top-of-rack switches downwards to servers, storage & appliances and upwards in switch-to-switch applications

     

    Solution Design

    • RH-OSP13 cloud is deployed in large scale over multiple racks interconnected via Spine/Leaf network architecture.
    • Each Compute/Controller node is equipped with a dual-port 100GB NIC of which one port is dedicated for tenant data traffic and the other for storage and control traffic.

    • Composable custom networks are used for network isolation between the racks. In our case, L3 OSPF underlay is used to route between the networks, however another fabric infrastructure could be used as desired.
    • ASAP2-enabled Compute nodes are located in different racks and maintain VXLAN tunnels as overlay for tenant VM traffic.
    • OVS ASAP2 data plane acceleration is used by the Compute nodes to offload the CPU-intensive VXLAN traffic, in order to avoid the encapsulation/decapsulation performance penalty and achieve impressive high throughput.

    • Switches are configured and managed by NEO.
    • OpenStack Neutron is used as an SDN controller.

    HW Configuration

     

    BOM

     

    Notes:

    • The BOM above is referring to the maximal configuration in a large scale with a blocking ratio of 3:1.
    • It is possible to change the blocking ratio in order to obtain a different capacity.
    • SN2100 Switch is sharing the same feature set with SN2700 and can be used in this solution when lower capacity is required.
    • The 2-Rack BOM will be used in the solution example described below.

    Large Scale Overview

    Maximal Scale Diagram


    Solution Example

    We have chosen the key features below as a baseline to demonstrate the accelerated RH-OSP solution.

    Solution Scale

    • 2 x racks with a custom network set per rack
    • 2 x SN2700 switches as Spine switches
    • 2 x SN2100 switches as Leaf switches, 1 per rack
    • 5 nodes in rack 1 (3 x Controller, 2 x Compute)
    • 2 nodes in rack 2 (2 x Compute)
    • All nodes are connected to Leaf switches using 2 x 100GB ports per node
    • Leaf switches are connected to each Spine switch using a single 100GB port

    Physical Diagram

    Network Diagram

    Notes:

    • The Storage network is configured, however no storage nodes are used.
    • Compute nodes are going out to the external network via the undercloud node.

     

     

    The configuration steps below refer to a solution example based on 2 racks.

    Network Configuration Steps

    Physical Configuration

    • Connect the switches to the switch mgmt network.
    • Interconnect the switches using 100GB/s cables.

    • Connect the Controller/Compute servers to the relevant networks according to the following diagrams:

    Role

    Leaf Switch Location

    Controller 1

    Rack 1

    Controller 2

    Rack 1

    Controller 3

    Rack 1

    Compute 1

    Rack 1

    Compute 2

    Rack 2

    Compute 3

    Rack 1

    Compute 4

    Rack 2

    • Connect the Undercloud Director server to the IPMI/PXE/External networks.

    OSPF Configuration

    Interface Configuration

    • Set VLANs and VLAN interfaces on the Leaf switches according to the following diagrams:

     

    Network Name

    Network Set

    Leaf Switch Location

    Network Details

    Switch Interface IP

    VLAN ID

    Switchport Mode

    Storage

    1

    Rack 1

    172.16.0.0 / 24

    172.16.0.1

    11

    hybrid

    Storage_Mgmt

    172.17.0.0 / 24

    172.17.0.1

    21

    hybrid

    Internal API

    172.18.0.0 / 24

    172.18.0.1

    31

    hybrid

    Tenant

    172.19.0.0 / 24

    172.19.0.1

    41

    access

    Storage_2

    2

    Rack 2

    172.16.2.0 / 24

    172.16.2.1

    12

    hybrid

    Storage_Mgmt_2

    172.17.2.0 / 24

    172.17.2.1

    22

    hybrid

    Internal API _2

    172.18.2.0 /24

    172.18.2.1

    32

    hybrid

    Tenant _2

    172.19.2.0 /2 4

    172.19.2.1

    42

    access

     

    • Use Mellanox NEO to provision the VLANs and interfaces via the pre-defined Provisioning Tasks:
      • Add-VLAN to create the VLANs and set names.
      • Set-Access-VLAN-Port to set access VLAN on the tenant network ports.
      • Set-Hybrid-Vlan-Port to allow the required VLANs on the storage/storage_mgmt/internal API networks ports.
      • Add-VLAN-To-OSPF-Area for distribution of the networks over OSPF.
      • Add VLAN IP Address to set IP per VLAN (currently no pre-defined template).

     

     

     

    For example, in order to set the port hybrid mode and allowed VLAN via the pre-defined Provisioning Tasks:

    Note that since there is currently no pre-defined provisioning template for configuring the VLAN interface IP address, you can manually add the IP configuration into the “Add-VLAN-To-OSPF-Area” template and use it to define both IP addresses and OSPF distribution, for example:

     

    Solution Configuration and Deployment Steps

    Prerequisites

    HW Specifications must be identical for servers with the same role (Compute/Controller/etc.)

     

    Server Preparation

    For all servers, make sure that in BIOS settings:

    • SRIOV is enabled
    • Network boot is set on the interface connected to PXE network

     

    NIC Preparations

    SRIOV configuration is disabled on ConnectX-5 NICs by default and must be enabled for every NIC used by a Compute node.

    In order to enable and configure it, insert the Compute NIC into a test server with installed OS, and follow the steps below:

    • Verify using that firmware version is 16.21.2030 or newer:

    [root@host ~]# ethtool -i ens2f0

    driver: mlx5_core

    version: 5.0-0

    firmware-version: 16.22.1002 (MT_0000000009)

    expansion-rom-version:

    bus-info: 0000:07:00.0

    supports-statistics: yes

    supports-test: yes

    supports-eeprom-access: no

    supports-register-dump: no

    supports-priv-flags: yes

    In case it is older, download and burn the new firmware as explained in How to Install Mellanox OFED on Linux (Rev 4.4-2.0.7.0)

     

    • Install the mstflint package:
    [root@host ~]# yum install mstflint

     

    • Identify the PCI ID of the first 100G port and enable SRIOV:

    [root@host ~]# lspci | grep -i mel

    07:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

    07:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

    [root@host ~]#

    [root@host ~]# mstconfig -d 0000:07:00.0 query | grep -i sriov

    SRIOV_EN False(0)

    SRIOV_IB_ROUTING_MODE_P1 GID(0)

    SRIOV_IB_ROUTING_MODE_P2 GID(0)

    [root@host ~]# mstconfig -d 0000:07:00.0 set SRIOV_EN=1

    Device #1:

    ----------

     

    Device type: ConnectX5

    PCI device: 0000:07:00.0

     

    Configurations: Next Boot New

    SRIOV_EN False(0) True(1)

     

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

    • Set the number of VFs to a high value, such as 64, and reboot the server to apply new configuration:

    [root@host ~]# mstconfig -d 0000:07:00.0 query | grep -i vfs

    NUM_OF_VFS 0

    [root@host ~]#

    [root@host ~]#

    [root@host ~]# mstconfig -d 0000:07:00.0 set NUM_OF_VFS=64

     

    Device #1:

    ----------

     

    Device type: ConnectX5

    PCI device: 0000:07:00.0

     

    Configurations: Next Boot New

    NUM_OF_VFS 0 64

     

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

    [root@host ~]# reboot

    • Confirm the new settings were applied using the mstconfig query commands shown above.
    • Insert the NIC back to the Compute node.
    • Repeat the procedure above for every Compute node NIC used in our setup.

    Notes:

    • In our solution, the first port of the two 100G ports in every NIC is used for the ASAP2 accelerated data plane. This is the reason we enabled SRIOV only on the first Mellanox NIC PCI device (07:00.0 in the example above).
    • There are future plans to support an automated procedure to update and configure the NICs on the Compute nodes from the Undercloud.

     

    Accelerated RH-OSP Installation and Deployment Steps

    • Install Red Hat 7.5 OS on the Undercloud server and set an IP on its interface connected to the External network; make sure it has internet connectivity.
    • Install the Undercloud and the director as instructed in section 4 of the Red Hat OSP DIRECTOR INSTALLATION AND USAGE guide: Director Installation and Usage - Red Hat Customer Portal
      • Our undercloud.conf file is attached as a reference.
    • Configure a container image source as instructed in section 5 of the guide.
      • Our solution is using undercloud as a local registry.
    • Register the nodes of the overcloud as instructed in section 6.1.
      • Our instackenv.json file is attached as a reference.
    • Inspect the hardware of the nodes as instructed in section 6.2.
      • Once introspection is completed, it is recommended to confirm for each node that the desired root disk was detected since cloud deployment can fail later because of insufficient disk space. Use the following command to check the free space on the detected disk selected as root:

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node show 92c4c1cb-ce7d-48d4-a2d9-75b2651db097 | grep properties

    | properties | {u'memory_mb': u'131072', u'cpu_arch': u'x86_64', u'local_gb': u'418', u'cpus': u'24', u'capabilities': u'boot_option:local'}

    • “local_gb” value is representing the disk size. In case the disk size is low and not as expected, use the procedure described in section 6.6 for defining the root disk for the node. Note that an additional introspection cycle is required for this node after the root disk is changed.
    • Verify that all nodes were registered properly and changed their state to “available” before proceeding to the next step:

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node list

    +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

    | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

    +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

    | d1fca940-e341-491b-8afd-0cf6d748aa29 | controller-1 | None | power off | available | False |

    | 6b24d02c-3fd2-4e55-a730-c45008f01723 | controller-2 | None | power off | available | False |

    | 098c3e2d-1c70-41d2-983b-6c266387de0b | controller-3 | None | power off | available | False |

    | 91492c2a-b26c-49ef-9d4e-e492a1578076 | compute-1 | None | power off | available | False |

    | cdf9e0ec-e3cb-4005-86f6-d40e684a9b19 | compute-2 | None | power off | available | False |

    | 92c4c1cb-ce7d-48d4-a2d9-75b2651db097 | compute-3 | None | power off | available | False |

    | bb5e829a-834b-4eb1-b733-0012ce9d5f00 | compute-4 | None | power off | available | False |

    +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

    • Tagging Nodes into Profiles
      • Tag the controllers nodes into “control” default profile:

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-1

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-2

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-3

    • Create two new compute flavors -- one per rack (compute-r1, compute-r2) -- and attach the flavors to profiles with a correlated name:

    (undercloud) [stack@rhosp-director ~]$ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r1

    (undercloud) [stack@rhosp-director ~]$ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r1" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r1

     

    (undercloud) [stack@rhosp-director ~]$ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r2

    (undercloud) [stack@rhosp-director ~]$ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r2" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r2

    • Tag compute nodes 1,3 into “compute-r1” profile to associate it with Rack 1, and compute nodes 2,4 into “compute-r2” profile to associate it with Rack 2:

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:compute-r1,boot_option:local' compute-1

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:compute-r1,boot_option:local' compute-3

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:compute-r2,boot_option:local' compute-2

    (undercloud) [stack@rhosp-director ~]$ openstack baremetal node set --property capabilities='profile:compute-r2,boot_option:local' compute-4

    • Verify profile tagging per node using the command below:

    (undercloud) [stack@rhosp-director ~]$ openstack overcloud profiles list

    +--------------------------------------+--------------+-----------------+-----------------+-------------------+

    | Node UUID | Node Name | Provision State | Current Profile | Possible Profiles |

    +--------------------------------------+--------------+-----------------+-----------------+-------------------+

    | d1fca940-e341-491b-8afd-0cf6d748aa29 | controller-1 | available | control | |

    | 6b24d02c-3fd2-4e55-a730-c45008f01723 | controller-2 | available | control | |

    | 098c3e2d-1c70-41d2-983b-6c266387de0b | controller-3 | available | control | |

    | 91492c2a-b26c-49ef-9d4e-e492a1578076 | compute-1 | available | compute-r1 | |

    | cdf9e0ec-e3cb-4005-86f6-d40e684a9b19 | compute-2 | available | compute-r2 | |

    | 92c4c1cb-ce7d-48d4-a2d9-75b2651db097 | compute-3 | available | compute-r1 | |

    | bb5e829a-834b-4eb1-b733-0012ce9d5f00 | compute-4 | available | compute-r2 | |

    +--------------------------------------+--------------+-----------------+-----------------+-------------------+

    Note: It is possible to tag the nodes into profiles in instackenv.json file during node registration (section 6.1) instead of running the tag command per node, however flavors and profiles must be created in any case.

    Note: The configuration file examples in the following sections are partial and were employed to highlight specific sections. The full configuration files are attached to this document.

    • Role definitions:
      • Create the /home/stack/templates/ directory and generate inside it a new roles file (named _data.yaml) with two types of roles using the following command:

    (undercloud) [stack@rhosp-director ~]$ mkdir /home/stack/templates

    (undercloud) [stack@rhosp-director ~]$ cd /home/stack/templates/

    (undercloud) [stack@rhosp-director templates]$ openstack overcloud roles generate -o roles_data.yaml Controller ComputeSriov

    • Edit the file by changing ComputeSriov to ComputeSriov1:

    .

    .

    ###############################################################################

    # Role: ComputeSriov1 #

    ###############################################################################

    - name: ComputeSriov1

    description: |

    Compute SR-IOV Role R1

    CountDefault: 1

    networks:

    - InternalApi

    - Tenant

    - Storage

    HostnameFormatDefault: '%stackname%-computesriov1-%index%'

    disable_upgrade_deployment: True

    ServicesDefault:

    .

    .

    • Clone the entire ComputeSriov1 role section, change it to ComputeSriov2, and change its networks to represent the network set on the second rack:

    .

    .

    ###############################################################################

    # Role: ComputeSriov2 #

    ###############################################################################

    - name: ComputeSriov2

    description: |

    Compute SR-IOV Role R2

    CountDefault: 1

    networks:

    - InternalApi_2

    - Tenant_2

    - Storage_2

    HostnameFormatDefault: '%stackname%-computesriov2-%index%'

    disable_upgrade_deployment: True

    ServicesDefault:

    .

    .

    • Now the roles_data.yaml files include 3 types of roles: Controller and ComputeSriov1 which are associated with the Rack 1 network set, and ComputeSriov2 which is associated with teh Rack 2 network set.
    • The full configuration file is attached to this document for your convenience.
    • Environment File for Defining Node Counts and Flavors:

      • Create /home/stack/templates/node-info.yaml, as explained in section 6.7, edit it to include count per role and correlated flavors per role.
      • Full configuration file:

    parameter_defaults:

    OvercloudControllerFlavor: control

    OvercloudComputeSriov1Flavor: compute-r1

    OvercloudComputeSriov2Flavor: compute-r2

    ControllerCount: 3

    ComputeSriov1Count: 2

    • Mellanox NICs Listing
      • Run the following command to go over all registered nodes and identify the interface names of the dual port Mellanox 100GB NIC:

    (undercloud) [stack@rhosp-director templates]$ for node in $(openstack baremetal node list --fields uuid -f value) ; do openstack baremetal introspection interface list $node ; done

    .

    .

    +-----------+-------------------+----------------------+-------------------+----------------+

    | Interface | MAC Address | Switch Port VLAN IDs | Switch Chassis ID | Switch Port ID |

    +-----------+-------------------+----------------------+-------------------+----------------+

    | eno1 | ec:b1:d7:83:11:b8 | [] | 94:57:a5:25:fa:80 | 29 |

    | eno2 | ec:b1:d7:83:11:b9 | [] | None | None |

    | eno3 | ec:b1:d7:83:11:ba | [] | None | None |

    | eno4 | ec:b1:d7:83:11:bb | [] | None | None |

    | ens1f1 | ec:0d:9a:7d:81:b3 | [] | 24:8a:07:7f:ef:00 | Eth1/14 |

    | ens1f0 | ec:0d:9a:7d:81:b2 | [] | 24:8a:07:7f:ef:00 | Eth1/1 |

    +-----------+-------------------+----------------------+-------------------+----------------+

    Note: Names must be identical for all nodes, or at least for all nodes sharing the same role. In our case, it is ens2f0/ens2f1 in Controller nodes, and enf1f0/ens1f1 in Compute nodes.

    • HW Offload Configuration File
      • Locate /usr/share/openstack-tripleo-heat-templates/environments/ovs-hw-offload.yaml file and edit it according to the following guidelines per ComputeSriov role:
        • Set offload enabled
        • Set kernel args for huge pages
        • Set the desired interface for accelerated data plane (ens1f0 in our case)
        • Set the desired VF count (64 in our example)
        • Set the correct Nova PCI Passthrough devname, and physical_network: null
        • Set ExtraConfig for correlation between the role and the correct tenant/api network set
      • Full configuration file is attached to this document, see example below (settings per each role are marked in different color):

    # A Heat environment file that enables OVS Hardware Offload in the overcloud.

    # This works by configuring SR-IOV NIC with switchdev and OVS Hardware Offload on

    # compute nodes. The feature supported in OVS 2.8.0

     

    resource_registry:

    OS::TripleO::Services::NeutronSriovHostConfig: ../puppet/services/neutron-sriov-host-config.yaml

     

    parameter_defaults:

     

    NovaSchedulerDefaultFilters: ['RetryFilter','AvailabilityZoneFilter','RamFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter']

    NovaSchedulerAvailableFilters: ["nova.scheduler.filters.all_filters","nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter"]

     

    # Kernel arguments for ComputeSriov1 node

    ComputeSriov1Parameters:

    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=16 intel_iommu=on iommu=pt"

    OvsHwOffload: True

    # Number of VFs that needs to be configured for a physical interface

    NeutronSriovNumVFs: ["ens1f0:64:switchdev"]

    # Mapping of SR-IOV PF interface to neutron physical_network.

    # In case of Vxlan/GRE physical_network should be null.

    # In case of flat/vlan the physical_network should as configured in neutron.

    NovaPCIPassthrough:

    - devname: "ens1f0"

    physical_network: null

    NovaReservedHostMemory: 4096

    # Extra config for mapping the ovs local_ip to the relevant tenant network

    ComputeSriov1ExtraConfig:

    nova::vncproxy::host: "%{hiera('internal_api')}"

    neutron::agents::ml2::ovs::local_ip: "%{hiera('tenant')}"

     

    # Kernel arguments for ComputeSriov2 node

    ComputeSriov2Parameters:

    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=16 intel_iommu=on iommu=pt"

    OvsHwOffload: True

    # Number of VFs that needs to be configured for a physical interface

    NeutronSriovNumVFs: ["ens1f0:64:switchdev"]

    # Mapping of SR-IOV PF interface to neutron physical_network.

    # In case of Vxlan/GRE physical_network should be null.

    # In case of flat/vlan the physical_network should as configured in neutron.

    NovaPCIPassthrough:

    - devname: "ens1f0"

    physical_network: null

    NovaReservedHostMemory: 4096

    # Extra config for mapping the ovs local_ip to the relevant tenant network

    ComputeSriov2ExtraConfig:

    nova::vncproxy::host: "%{hiera('internal_api_2')}"

    neutron::agents::ml2::ovs::local_ip: "%{hiera('tenant_2')}"

     

    • Network Configuration File:
      • Locate /usr/share/openstack-tripleo-heat-templates/network_data.yaml file and edit it according to the following guidelines:
        • Set External network parameters (subnet, allocation pool, default GW) - marked in yellow in the example below.
        • Set rack 1 networks set parameters to match the subnets/vlans configured on Rack 1 Leaf switch - marked in blue in the example below.
          • Make sure you use the network names you specified in roles_data.yaml for Controller\ComputeSriov1 role networks.
        • Create a second set of networks to match the subnets/vlans configured on Rack 2 Leaf switch - marked in green in the example below.
          • Make sure you use the network names you specified in roles_data.yaml for ComputeSriov2 role networks.
        • Disable the “management” network, as it is not used in our example - marked in red in the example below.
        • The configuration is based on the following matrix to match the Leaf switch configuration as executed in Network Configuration section above:

     

    Network Name

    Network Set

    Network Location

    Network Details

    VLAN

    Network Allocation Pool

    Storage

    1

    Rack 1

    172.16.0.0/24

    11

    172.16.0.100-250

    Storage_Mgmt

     

    172.17.0.0/24

    21

    172.17.0.100-250

    Internal API

     

    172.18.0.0/24

    31

    172.18.0.100-250

    Tenant

     

    172.19.0.0/24

    41

    172.19.0.100-250

    Storage_2

    2

    Rack 2

    172.16.2.0/24

    12

    172.16.2.100-250

    Storage_Mgmt_2

     

    172.17.2.0/24

    22

    172.17.2.100-250

    Internal API _2

     

    172.18.2.0/24

    32

    172.18.2.100-250

    Tenant _2

     

    172.19.2.0/24

    42

    172.19.2.100-250

    External

    -

    Public Switch

    10.7.208.0/24

    -

    10.7.208.10-21

     

    • Full configuration file is attached to this document
    • Partial example for one of the configured networks (Storage network - 2 sets), External network and Management network configuration:

    .

    .

    - name: Storage

    vip: true

    vlan: 11

    name_lower: storage

    ip_subnet: '172.16.0.0/24'

    allocation_pools: [{'start': '172.16.0.100', 'end': '172.16.0.250'}]

    ipv6_subnet: 'fd00:fd00:fd00:1100::/64'

    ipv6_allocation_pools: [{'start': 'fd00:fd00:fd00:1100::10', 'end': 'fd00:fd00:fd00:1100:ffff:ffff:ffff:fffe'}]

    .

    .

    - name: Storage_2

    vip: true

    vlan: 12

    name_lower: storage_2

    ip_subnet: '172.16.2.0/24'

    allocation_pools: [{'start': '172.16.2.100', 'end': '172.16.2.250'}]

    ipv6_subnet: 'fd00:fd00:fd00:1200::/64'

    ipv6_allocation_pools: [{'start': 'fd00:fd00:fd00:1200::10', 'end': 'fd00:fd00:fd00:1200:ffff:ffff:ffff:fffe'}]

    .

    .

    - name: External

    vip: true

    name_lower: external

    vlan: 10

    ip_subnet: '10.7.208.0/24'

    allocation_pools: [{'start': '10.7.208.10', 'end': '10.7.208.21'}]

    gateway_ip: '10.7.208.1'

    ipv6_subnet: '2001:db8:fd00:1000::/64'

    ipv6_allocation_pools: [{'start': '2001:db8:fd00:1000::10', 'end': '2001:db8:fd00:1000:ffff:ffff:ffff:fffe'}]

    gateway_ipv6: '2001:db8:fd00:1000::1'

     

    - name: Management

    # Management network is enabled by default for backwards-compatibility, but

    # is not included in any roles by default. Add to role definitions to use.

    enabled: false

    .

    .

     

    • Deploying a plan from existing templates
      • Use the following command to create a plan called “asap-plan”:
    (undercloud) [stack@rhosp-director templates]$ openstack overcloud plan create --templates /usr/share/openstack-tripleo-heat-templates asap-plan
    • Create a dedicated folder and deploy the plan files inside it:

    (undercloud) [stack@rhosp-director templates]$ mkdir /home/stack/asap-plan

    (undercloud) [stack@rhosp-director templates]$ cd /home/stack/asap-plan

    (undercloud) [stack@rhosp-director asap-plan]$ openstack container save asap-plan

    • Editing plan files to be used in deployment
      • Copy the following files into the /home/stack/templates directory
        • /home/stack/asap-plan/environments/network-environment.yaml
        • /home/stack/asap-plan/network/config/single-nic-vlans/controller.yaml
        • /home/stack/asap-plan/network/config/single-nic-vlans/computesriov1.yaml
        • /home/stack/asap-plan/network/config/single-nic-vlans/computesriov2.yaml
      • Edit /home/stack/templates/network-environment.yaml according to the following guidelines:
        • Set the role file locations under resource_registry section - marked in yellow in the example below.
        • Set the Undercloud control plane IP as the default route for this network - marked in blue in the example below.
        • Set the required DNS servers for the setup nodes - marked in green in the example below.
        • See example below. Full configuration file is attached to this document.

    #This file is an example of an environment file for defining the isolated

    #networks and related parameters.

    resource_registry:

    # Network Interface templates to use (these files must exist). You can

    # override these by including one of the net-*.yaml environment files,

    # such as net-bond-with-vlans.yaml, or modifying the list here.

    # Port assignments for the Controller

    OS::TripleO::Controller::Net::SoftwareConfig:

    /home/stack/templates/controller.yaml

    # Port assignments for the ComputeSriov1

    OS::TripleO::ComputeSriov1::Net::SoftwareConfig:

    /home/stack/templates/computesriov1.yaml

    # Port assignments for the ComputeSriov2

    OS::TripleO::ComputeSriov2::Net::SoftwareConfig:

    /home/stack/templates/computesriov2.yaml

     

    parameter_defaults:

    # This section is where deployment-specific configuration is done

    # CIDR subnet mask length for provisioning network

    ControlPlaneSubnetCidr: '24'

    # Gateway router for the provisioning network (or Undercloud IP)

    ControlPlaneDefaultRoute: 192.168.24.1

    .

    .

    # Define the DNS servers (maximum 2) for the overcloud nodes

    DnsServers: ["10.7.77.192","10.7.77.135"]

     

    • Edit /home/stack/templates/controller.yaml according to the following guidelines:
      • Set the location of run-os-net-config.sh script - marked in yellow in the example below.
      • Set Supernet and GW per network to allow routing between network sets located in different racks. The GW would be the IP interface which was configured on the Leaf switch interface facing this network. Supernet and gateway for 2 tenant networks are marked in green in the example below.
      • Set type, networks and routes for each interface used by Controller nodes - marked in blue in the example below. In our example, we use for Controller nodes:
        • Dedicated 1G interface (type “interface”) for provisioning (PXE) network.
        • Dedicated 1G interface (type “ovs_bridge”) for External network. This network has a default GW configured.
        • Dedicated 100G interface (type “interface” without vlans) for data plane (Tenant) network in Rack 1. The network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks.
        • Dedicated 100G interface (type “ovs_bridge”) with vlans for Storage/StorageMgmt/InternalApi networks in Rack 1. Each network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks.
        • See example below. Full configuration file is attached to this document.

    .

    .

    TenantSupernet:

    default: '172.19.0.0/16'

    description: Supernet that contains Tenant subnets for all roles.

    type: string

    TenantGateway:

    default: '172.19.0.1'

    description: Router gateway on tenant network

    type: string

    Tenant_2Gateway:

    default: '172.19.2.1'

    description: Router gateway on tenant_2 network

    type: string

    .

    .

    resources:

    OsNetConfigImpl:

    type: OS::Heat::SoftwareConfig

    properties:

    group: script

    config:

    str_replace:

    template:

    get_file: /usr/share/openstack-tripleo-heat-templates/network/scripts/run-os-net-config.sh

    params:

    $network_config:

    network_config:

    .

    .

    # NIC 3 - Data Plane (Tenant net)

    - type: ovs_bridge

    name: br-sriov

    use_dhcp: false

    members:

    - type: interface

    name: ens2f0

    addresses:

    - ip_netmask:

    get_param: TenantIpSubnet

    routes:

    - ip_netmask:

    get_param: TenantSupernet

    next_hop:

    get_param: TenantGateway

    .

    .

    • Edit /home/stack/templates/computesriov1.yaml according to the following guidelines:
      • Set the location of run-os-net-config.sh script - not mentioned in the example below, see example above or full configuration file.
      • Set Supernet and GW per network to allow routing between network sets located in different racks. The GW would be the IP interface which was configured on the Leaf switch interface facing this network. - not mentioned in the example below, see example above or full configuration file.
      • Set type, networks and routes for each interface used by Compute nodes in Rack 1. In our example, we use for those ComputeSriov1 nodes:
        • Dedicated 1G interface (type “interface”) for provisioning (PXE) network - marked in yellow in the example below.
        • Dedicated 100G interface (type “interface” without vlans) for data plane (Tenant) network in Rack 1. The network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks - marked in blue in the example below.
        • Dedicated 100G interface (type “ovs_bridge”) with vlans for Storage/InternalApi networks in Rack 1. Each network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks - not mentioned in the example below, see full configuration file.

    .

    .

    # NIC 1 - Provisioning net

    - type: interface

    name: eno1

    use_dhcp: false

    dns_servers:

    get_param: DnsServers

    addresses:

    - ip_netmask:

    list_join:

    - /

    - - get_param: ControlPlaneIp

    - get_param: ControlPlaneSubnetCidr

    routes:

    - ip_netmask: 169.254.169.254/32

    next_hop:

    get_param: EC2MetadataIp

    - default: true

    next_hop:

    get_param: ControlPlaneDefaultRoute

     

     

    # NIC 2 - ASAP2 Data Plane (Tenant net)

    - type: ovs_bridge

    name: br-sriov

    use_dhcp: false

    members:

    - type: interface

    name: ens1f0

    addresses:

    - ip_netmask:

    get_param: TenantIpSubnet

    routes:

    - ip_netmask:

    get_param: TenantSupernet

    next_hop:

    get_param: TenantGateway

    .

    .

     

     

    • Edit /home/stack/templates/computesriov2.yaml according to the following guidelines:
      • Set the location of run-os-net-config.sh script - not mentioned in the example below, see example above or full configuration file.
      • Set Supernet and GW per network to allow routing between networks set located in different racks. The GW would be the IP interface which was configured on the Leaf switch interface facing this network - not mentioned in the example below, see example above or full configuration file.
      • Set type, networks and routes for each interface used by Compute nodes in Rack 2. In our example, we use for those ComputeSriov2 nodes:
        • Dedicated 1G interface (type “interface”) for provisioning (PXE) network - not mentioned in the example below, see example above or full configuration file.
        • Dedicated 100G interface (type “interface” without vlans) for data plane (Tenant) network in Rack 2. The network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks - marked in yellow in the example below.
        • Dedicated 100G interface (type “ovs_bridge”) with vlans for Storage/InternalApi networks in Rack 2. Each network is associated with a supernet and has a route allowing it to reach other networks in the same supernet located in different racks - not mentioned in the example below, see example above or full configuration file.
        • See example below. Full configuration file is attached to this document.

    .

    .

    # NIC 2 - ASAP2 Data Plane (Tenant net)

    - type: ovs_bridge

    name: br-sriov

    use_dhcp: false

    members:

    - type: interface

    name: ens1f0

    addresses:

    - ip_netmask:

    get_param: Tenant_2IpSubnet

    routes:

    - ip_netmask:

    get_param: TenantSupernet

    next_hop:

    get_param: Tenant_2Gateway

    .

    .

     

    • Deploying the overcloud
      • Now we are ready to deploy an overcloud based on our customized configuration files
      • The cloud will be deployed with:
        • 3 controllers associated with Rack 1 networks
        • 2 Compute nodes associated with Rack 1 networks with ASAP2 OVS HW offload
        • 2 Compute nodes associated with Rack 2 networks with ASAP2 OVS HW offload
        • Routes to allow connectivity between racks/networks
        • VXLAN overlay tunnels between all the nodes
      • Before starting the deployment, verify connectivity between the racks' Leaf switches SW vlan interfaces facing the nodes over the OSPF underlay fabric. Without inter-rack connectivity for all networks, the overcloud deployment will fail.
      • In order to start the overcloud deployment, issue the command below (custom environment files are marked in yellow in the example below).

    Notes:

    • Do not change the order of the environment files.
    • Make sure that the NTP server specified in the deploy command is accessible and can provide time to the undercloud node - marked in blue in the example below.
    • The overcloud_images.yaml file used in the deploy command is created during undercloud installation, verify its existence in the specified location - marked in blue in the example below.
    • The network-isolation.yaml file specified in the deploy command is created automatically during deployment from j2.yaml template file - marked in blue in the example below.

    (undercloud) [stack@rhosp-director templates]$ openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \

    --libvirt-type kvm \

    -n /usr/share/openstack-tripleo-heat-templates/network_data.yaml \

    -r /home/stack/templates/roles_data.yaml \

    --timeout 90 \

    --validation-warnings-fatal \

    --ntp-server 0.asia.pool.ntp.org \

    -e /home/stack/templates/node-info.yaml \

    -e /home/stack/templates/overcloud_images.yaml \

    -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \

    -e /home/stack/templates/network-environment.yaml \

    -e /usr/share/openstack-tripleo-heat-templates/environments/ovs-hw-offload.yaml \

    -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml \

    -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml

    • Overcloud VXLAN Configuration Validation
      • Once the cloud is deployed, login into the overcloud nodes and verify that VXLAN tunnels have been established between this node and the rest of the overcloud nodes over the routed Tenant networks.
      • In the example below, we can see VXLAN tunnels are maintained in the OVS level between a node located in rack 2 (tenant network 17.19.2.0/24, marked in yellow in the example below) and all other nodes located in rack 1 (tenant network 172.19.0.0/24, marked in blue in the example below), in addition to the one located in its own rack (marked in green in the example below).

    (undercloud) [stack@rhosp-director ~]$ openstack server list

    +--------------------------------------+---------------------------+--------+------------------------+----------------+------------+

    | ID | Name | Status | Networks | Image | Flavor |

    +--------------------------------------+---------------------------+--------+------------------------+----------------+------------+

    | 35d3b3b6-b867-4408-bfc3-b3d25395450d | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | control |

    | 0af372ed-4c5c-41fb-882a-c8a61cc01ba9 | overcloud-controller-1 | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | control |

    | 3c189bb9-fd2f-451d-b2f8-4d17d7fa0381 | overcloud-computesriov1-1 | ACTIVE | ctlplane=192.168.24.13 | overcloud-full | compute-r1 |

    | 7eebc6f0-95af-4ec6-bf44-4db817bc4029 | overcloud-computesriov2-1 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | compute-r2 |

    | ebc7c38b-6221-45c9-b5ca-98023f5bbebc | overcloud-controller-2 | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | control |

    | 7c700b2c-6a9f-480f-ada5-11866a891f04 | overcloud-computesriov2-0 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | compute-r2 |

    | 971e8651-4059-42b9-834d-74449007343d | overcloud-computesriov1-0 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute-r1 |

    +--------------------------------------+---------------------------+--------+------------------------+----------------+------------+

     

    (undercloud) [stack@rhosp-director ~]$ ssh heat-admin@192.168.24.12

    [heat-admin@overcloud-computesriov2-0 ~]$ sudo su

    [root@overcloud-computesriov2-0 heat-admin]# ovs-vsctl show

    .

    .

    Bridge br-tun

    Controller "tcp:127.0.0.1:6633"

    is_connected: true

    fail_mode: secure

    Port "vxlan-ac130068"

    Interface "vxlan-ac130068"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.0.104"}

    Port "vxlan-ac130070"

    Interface "vxlan-ac130070"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.0.112"}

    Port "vxlan-ac13026c"

    Interface "vxlan-ac13026c"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.2.108"}

    Port "vxlan-ac13006b"

    Interface "vxlan-ac13006b"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.0.107"}

    Port br-tun

    Interface br-tun

    type: internal

    Port patch-int

    Interface patch-int

    type: patch

    options: {peer=patch-tun}

    Port "vxlan-ac130064"

    Interface "vxlan-ac130064"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.0.100"}

    Port "vxlan-ac130065"

    Interface "vxlan-ac130065"

    type: vxlan

    options: {df_default="true", in_key=flow, local_ip="172.19.2.102", out_key=flow, remote_ip="172.19.0.101"}

    .

    .

     

    • Overcloud Host Aggregate Configuration
      • In order to enable the option to specify the target rack for VM creation, a Host Aggregate per rack must be configured first.
      • Login into the Overcloud dashboard and create Host Aggregate for Rack 1. Add Compute nodes 1,3 into it. You can identify the relevant hypervisors by its hostname which indicates their role/rack location.
      • Create a Host Aggregate for Rack 2 and add Compute nodes 2,4 into it.

     

    • Overcloud Instance Creation with ASAP2-based Ports
      • Create a Flavor as desired.
      • Upload an Image – use an updated OS image which includes the latest Mellanox drivers.
      • Create a VXLAN overlay private network to be used by the instances (use cli commands only) - marked in yellow in the example below.

    (undercloud) [stack@rhosp-director ~]$ source overcloudrc

    (overcloud) [stack@rhosp-director ~]$ openstack network create private --provider-network-type vxlan --share

    • Create a subnet and assign it to the private network:
    (overcloud) [stack@rhosp-director ~]$ openstack subnet create private_subnet --dhcp --network private --subnet-range 11.11.11.0/24
    • Create ports with ASAP2 capabilities (use CLI commands only) - 2 direct ports are created in the example below each one is marked in a different color:

    (overcloud) [stack@rhosp-director ~]$ openstack port create direct1 --vnic-type=direct --network private --binding-profile '{"capabilities":["switchdev"]}'

    (overcloud) [stack@rhosp-director ~]$ openstack port create direct2 --vnic-type=direct --network private --binding-profile '{"capabilities":["switchdev"]}'

    • Spawn on each rack an instance with ASAP2 ports. Use allocated ports only without an allocated network as shown below:

     

    • OVS ASAP2 Offload Validation
      • Ping or run traffic between the instances. The traffic will go over the OVS VXLAN overlay network and will be accelerated by ASAP2 HW offload into the NIC.
      • SSH into the Compute nodes that are holding the instances and issue the following command to see the accelerated bi-directional traffic flows that were offloaded to the NIC using ASAP2 - each direction is marked in different color in the example below.

    [root@overcloud-computesriov2-0 heat-admin]# ovs-dpctl dump-flows type=offloaded --name

    in_port(eth3),eth(src=fa:16:3e:15:e5:a8,dst=fa:16:3e:01:b3:aa),eth_type(0x0800),ipv4(frag=no), packets:1764662605, bytes:194112828502, used:0.470s, actions:set(tunnel(tun_id=0xf,src=172.19.2.102,dst=172.19.0.104,tp_dst=4789,flags(key))),vxlan_sys_4789

    tunnel(tun_id=0xf,src=172.19.0.104,dst=172.19.2.102,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789),eth(src=fa:16:3e:01:b3:aa,dst=fa:16:3e:15:e5:a8),eth_type(0x0800),ipv4(frag=no), packets:1760910540, bytes:105654631616, used:0.470s, actions:eth3

    tunnel(tun_id=0xf,src=172.19.0.112,dst=172.19.2.102,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789),eth(src=fa:16:3e:5f:6d:3e,dst=fa:16:3e:58:26:bf),eth_type(0x0806), packets:2, bytes:84, used:8.950s, actions:eth1

    tunnel(tun_id=0xf,src=172.19.0.107,dst=172.19.2.102,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789),eth(src=fa:16:3e:62:be:6c,dst=fa:16:3e:15:e5:a8),eth_type(0x0806), packets:2, bytes:84, used:0.470s, actions:eth3