HowTo Launch VM over OVS-DPDK-16.07 Using Mellanox ConnectX-4

Version 13

    This post shows how to launch a Virtual Machine (VM) over OVS-DPDK 16.07 using Mellanox ConnectX-4 adapters.

    In this example MLNX_OFED 3.3 was used.

     

    References

     

    Prerequisites

    1. Install MLNX_OFED, and using the ofed_info command verify that the version is 3.3:

    # ofed_info -s

    Note: MLNX_OFED 3.4 was not tested with DPDK 16.07.

     

    2. Check CPU support for 1G hugepages by checking for the pdpe1gb flag:

    # cat /proc/cpuinfo | grep pdpe1gb
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid

    Make sure that the flag list includes the pdpe1gb flag.

     

    3. Find the NUMA number per PCI slot:

    # mst status -v

    MST modules:
    ------------
        MST PCI module is not loaded
        MST PCI configuration module is not loaded
    PCI devices:
    ------------
    DEVICE_TYPE             MST      PCI       RDMA    NET                      NUMA
    ConnectX4LX(rev:0)      NA       03:00.0   mlx5_1  net-ens2f0                0

     

    4. Check the QEMU version information. It must be Rev. 2.1 or above.

    # qemu-system-x86_64 --version
    QEMU emulator version 2.7.50 (v2.7.0-456-gffd455a), Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers

     

    To download a new qemu version refer to the official qemu site:

    http://wiki.qemu.org/Download

     

    Configuration

     

    Configure the grub File and Mount hugepages

    1. Update the grub.conf file.

    Note: Updating grub files is different for each Linux OS distribution. Refer to OS documentation.

     

    Edit the grub.conf file in the GRUB_CMDLINE_LINUX line as follows:

    "intel_iommu=on default_hugepagesz=1G hugepagesz=1G hugepages=8"

     

    The following line defines the hugepages size and quantity. For most Intel processors, 2Mb and 1G sizes are supported.  It is recommended that you leave the OS at least 2G RAM free.

     

    2. Update the grub script.

     

    3. Reboot the server. The configuration starts after you reboot.

     

    4 . Check that hugepages are loaded correctly after reboot:

    # cat /proc/meminfo | grep Hug

    AnonHugePages:   2314240 kB
    HugePages_Total:       8
    HugePages_Free:        8
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:    1048576 kB

    The output shows eight hugepages that are free, each of which is 1G in size.

     

    5. Mount 1G hugepages.

    # mkdir -p /dev/hugepages
    # mount -t hugetlbfs -o pagesize=1G none /dev/hugepages

     

    Note: The mount of hugepages is not persistent. You must mount hugepages after each reboot.

     

    DPDK Configuration

    1. Download and extract the DPDK 16.07 package.

    # cd /usr/src/
    # wget http://dpdk.org/browse/dpdk/snapshot/dpdk-16.07.zip
    # unzip dpdk-16.07.zip

     

    2. Set DPDK environment variables as follows:

    # export DPDK_DIR=/usr/src/dpdk-16.07
    # cd $DPDK_DIR
    # export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc
    # export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET

    Note: When you use the IVShmem library you can share 1G hugepages with host and guest machines.

     

    3. Modify the compilation settings so that they support the ConnectX-4 interface.

    # echo CONFIG_RTE_BUILD_COMBINE_LIBS=y >>  $DPDK_DIR/config/common_linuxapp
    # sed -i 's/\(CONFIG_RTE_LIBRTE_MLX5_PMD=\)n/\1y/g' $DPDK_DIR/config/common_base

     

    4. Compile your code.

    # make -j install T=$DPDK_TARGET DESTDIR=install

     

    OVS Configuration

    1. Download the OVS version (2.6.1 or later).

    # cd /usr/src/
    # wget http://openvswitch.org/releases/openvswitch-2.6.1.tar.gz

    2. Set environment variables to compile OVS with DPDK 16.07.

    # tar xf openvswitch-2.6.1.tar.gz

    # export OVS_DIR=/usr/src/openvswitch-2.6.1
    # cd $OVS_DIR

     

    3. Compile your code.

    # ./boot.sh
    # ./configure --with-dpdk=$DPDK_BUILD
    # make -j LDFLAGS=-libverbs
    # make install

     

    4. Reset the OVS environment.

    # pkill -9 ovs
    # rm -rf /usr/local/var/run/openvswitch/
    # rm -rf /usr/local/etc/openvswitch/
    # rm -f /usr/local/etc/openvswitch/conf.db
    # mkdir -p /usr/local/var/run/openvswitch/
    # mkdir -p /usr/local/etc/openvswitch/
    # rm -f /tmp/conf.db

     

    5. Specify the initial Open vSwitch (OVS) database to use:

    # mkdir -p /usr/local/etc/openvswitch
    # mkdir -p /usr/local/var/run/openvswitch
    # ovsdb-tool create /usr/local/etc/openvswitch/conf.db /usr/local/share/openvswitch/vswitch.ovsschema
    # ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
    # export DB_SOCK=/usr/local/var/run/openvswitch/db.sock

     

    6. Configure the OVS to support DPDK ports:

    # ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

     

    7. Configure the source code analyzer (PMD) to work with 2G hugespages and NUMA node0.

    # ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="2048,0"

    Note: Find the correct NUMA number according to the prerequisites section. This example shows 2G (=2048) Hugpages and NUMA number 0.

     

    8. Set dpdk interface to be used following step 3 PCI id output.

    # ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra="-w 3:00.0"

    9. Set the core mask to enable several PMDs. In this example cores 1 and 2 are used 6=Bin 0110.

    # ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu-mask=6

     

    10. Start the vswitchd daemon:

    # ovs-vsctl --no-wait init
    # ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file=/var/log/ovs-vswitchd.log

     

    Note: Each time you reboot or there is an OVS termination, you need to rebuild the OVS environment and repeat steps 4-9 of this section.

     

    11. Create an OVS bridge.

    # ovs-vsctl add-br br1 -- set bridge br1 datapath_type=netdev

     

    12. Create a DPDK port (dpdk0) with two RX queues using the n_rxg=2 option.

    # ovs-vsctl add-port br1 dpdk0 -- set Interface dpdk0 type=dpdk ofport_request=1
    # ovs-vsctl set Interface dpdk0 options:n_rxq=2 other_config:pmd-rxq-affinity="0:1,1:2"

    PMD 0 and 1 are set by affinity to cores 1 and 2, respectively, for the host machine.

     

    13. Create a vhost-user port toward the guest machine with two RX queues and core affinity:

    # ovs-vsctl add-port br1 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser ofport_request=2
    # ovs-vsctl set Interface vhost-user1 options:n_rxq=2 other_config:pmd-rxq-affinity="0:1,1:2"

    PMD 0 and 1 are set by affinity to cores 1 and 2 respectively for the host machine.

     

    Notes:

    • Make sure that you have enough free cores for host and guest PMDs, depending on the amount of RX queues configured. For the configuration above, two PMD cores are required for host and guest (four cores total).

    • Use the top command to see that two PMD cores are running at 100% CPU usage.

    • If you do not use ofport_request in the OVS control command, the OVS will select a random port ID.

     

    14. Set the environment parameter to identify the port IDs.

    # DPDK0_INDEX=$(echo `ovs-ofctl show br1 | grep dpdk0 | cut -d '(' -f 1`)

    # VHOST_USER1_INDEX=$(echo `ovs-ofctl show br1 | grep vhost-user1 | cut -d '(' -f 1`)

    Verify the assigned IDs:

    # echo $DPDK0_INDEX
    1
    # echo $VHOST_USER1_INDEX
    2

     

    15. Set OVS flow rules:

    # ovs-ofctl add-flow br1 in_port=$DPDK0_INDEX,action=output:$VHOST_USER1_INDEX
    # ovs-ofctl add-flow br1 in_port=$VHOST_USER1_INDEX,action=output:$DPDK0_INDEX

    Verify the new flows' insertion:

    # ovs-ofctl dump-flows br1
    NXST_FLOW reply (xid=0x4):
    cookie=0x0, duration=4.903s, table=0, n_packets=8035688, n_bytes=482141280, idle_age=0, in_port=1 actions=output:2
    cookie=0x0, duration=3.622s, table=0, n_packets=0, n_bytes=0, idle_age=3, in_port=2 actions=output:1
    cookie=0x0, duration=353.725s, table=0, n_packets=284039649, n_bytes=17042378940, idle_age=5, priority=0 actions=NORMAL

     

    VM Configuration

    1. Launch a guest machine.

    # numactl --cpunodebind 0 --membind 0 -- \

    echo 'info cpus' | \

    /usr/src/qemu/x86_64-softmmu/qemu-system-x86_64 \

    -enable-kvm \

    -name gen-l-vrt-019-006-Ubuntu-15.10 \

    -cpu host -m 6G \

    -realtime mlock=off \

    -smp 8,sockets=8,cores=1,threads=1 \

    -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/gen-l-vrt-019-006-Ubuntu-15.10.monitor,server,nowait \

    -drive file=/images/gen-l-vrt-019-006/gen-l-vrt-019-006.img,if=none,id=drive-ide0-0-0,format=qcow2 \

    -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 \

    -netdev tap,id=hostnet0,script=no,downscript=no \

    -device e1000,netdev=hostnet0,id=net0,mac=00:50:56:1b:b2:05,bus=pci.0,addr=0x3 \

    -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user1 \

    -netdev type=vhost-user,id=iface1,chardev=char1,vhostforce,queues=2 \

    -device virtio-net-pci,netdev=iface1,mac=12:34:00:00:50:2c,csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off,mrg_rxbuf=off,mq=on,vectors=6 \

    -object memory-backend-file,id=mem,size=6144M,mem-path=/dev/hugepages,share=on \

    -numa node,memdev=mem \

    -mem-prealloc \

    -monitor stdio \

    > /tmp/qemu_cpu_info.txt&

     

    • Choose the right number of vhost-user queues to match the RX queues configured. In this example queue=2.
    • In some environments the external e1000 interface fails to come up. If it fails, add the following script=no,downscript=no.
      While the guest boots, bring up the interface manually as follows (in this example br0 is the main host bridge to the public network):
      # brctl addif br0 tap0
      # ifconfig tap0 up
    • The number of shared memory size (mem,size) must be equal to "cpu host -m".  Also the amount of memory must not exceed the total amount of free memory for hugepages on the host.

     

    2. Set CPU affinity for eight cores :

    # a=( $(cat /tmp/qemu_cpu_info.txt  | grep thread_id | cut -d '=' -f 3 | tr -d '\r' ) )
    # taskset -p 0x004  ${a[0]}
    # taskset -p 0x008  ${a[1]}
    # taskset -p 0x010  ${a[2]}
    # taskset -p 0x020  ${a[3]}
    # taskset -p 0x040  ${a[4]}
    # taskset -p 0x080  ${a[5]}
    # taskset -p 0x100  ${a[6]}
    # taskset -p 0x200  ${a[7]}

     

    Make sure that the chosen cores for affinity do not correlate with the host PMD cores configured (see OVS configuration in section 8). In this example eight cores cores from 3-10 are pinned for the guest machine.

     

    3. Configure the guest machine to have 1G hugepages.

    Refer to (grub and hugepages configuration section above).

     

    4. Load the guest DPDK driver to use the virtio interface.

    It is assumed that the guest image already includes the compiled DPDK driver.

    For further information on how to compile DPDK-16.07 for guest machine refer to:

    Compiling the DPDK Target from Source — Data Plane Development Kit 16.07.0 documentation

     

    5. Find the virtio interface bus number:

    # lspci -nn | grep -i virtio

    Example of command output:

    00:04.0 Ethernet controller [0200]: Red Hat, Inc Virtio network device [1af4:1000]

     

    6. Load the DPDK driver:

    # modprobe uio
    # insmod /usr/src/dpdk-16.07/x86_64-ivshmem-linuxapp-gcc/kmod/igb_uio.ko

    7. Bind the DPDK driver to the PCI slot of the virtio interface:

    # /usr/src/dpdk-16.07/tools/dpdk-devbind.py --bind=igb_uio 0000:00:04.0

     

    Verification

    1. Run testpmd to loop traffic for single port with UDP RSS.

    The example that follows runs with 2G hugepages and two PMD cores.

    # /usr/src/dpdk-16.07/x86_64-ivshmem-linuxapp-gcc/app/testpmd -v -c 0x1f  -n 4 -m 2048 -- --burst=64 --rxq=2 --txq=2 --nb-cores=4 -a -i --mbcache=256 --rss-udp --port-topology=chained

    Expected output:

    EAL: Detected 8 lcore(s)

    EAL: RTE Version: 'DPDK 16.07.0'

    EAL: Probing VFIO support...

    EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !

    PMD: bnxt_rte_pmd_init() called for (null)

    EAL: PCI device 0000:00:04.0 on NUMA socket -1

    EAL:   probe driver: 1af4:1000 rte_virtio_pmd

    EAL: No probed ethernet devices

    Auto-start selected

    Interactive-mode selected

    USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=180224, size=2176, socket=0

    Done

    Start automatic packet forwarding

    io packet forwarding - ports=0 - cores=0 - streams=0 - NUMA support disabled, MP over anonymous pages disabled

    io packet forwarding - CRC stripping disabled - packets/burst=64

    nb forwarding cores=4 - nb forwarding ports=0

    RX queues=2 - RX desc=128 - RX free threshold=0

    RX threshold registers: pthresh=0 hthresh=0 wthresh=0

    TX queues=2 - TX desc=512 - TX free threshold=0

    TX threshold registers: pthresh=0 hthresh=0 wthresh=0

    TX RS bit threshold=0 - TXQ flags=0x0

     

    The number of --rxq --txq queues must be equal to the number of queues defined before in 'Launch guest machine' section.

    Note: UDP RSS takes effect only if you are injecting various source UDP ports.

     

    Performance Tuning Recommendations

    1. Configure the grub.conf file (in the GRUB_CMDLINE_LINUX line) to isolate and remove interrupts from the PMD CPUs.

    It is not recommended that you list core0.

    "isolcpus=1-8 nohz_full=1-8 rcu_nocbs=1-8"

     

    2. Set scaling_governor to performance mode:

    # for (( i=0; i<$(cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | wc -l); i++ )); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done 

     

    3. Stop the irq balancing service:

    # service irqbalance stop

     

    4. Disable kernel memory compaction:

    # echo never > /sys/kernel/mm/transparent_hugepage/defrag
    # echo never > /sys/kernel/mm/transparent_hugepage/enabled
    # echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
    # sysctl -w vm.zone_reclaim_mode=0
    # sysctl -w vm.swappiness=0

     

    5. Inside the VM, disable ksm/run.

    # echo 0 > /sys/kernel/mm/ksm/run