VMA Performance Tuning Guide

Version 19

    This post provides guidelines for improving performance with VMA. It is intended for administrators who are familiar with VMA and should be used in conjunction with the VMA User Manual and the VMA Release Notes. You can minimize latency by tuning VMA parameters. It is recommended to test VMA performance tuning on an actual application. We suggest that you try the following VMA parameters one by one and in combination to find the optimum for your application. For more information about each parameter, see the VMA User Manual (see here).

     

    References

     

    Server Tuning

    A process run with VMA traffic offloading should run on a dedicated core, operating at its highest frequency. This requires configuring BIOS, GRUB configurations and system power-management services.

     

    BIOS Tuning

    Follow  Understanding BIOS Configuration for Performance Tuning  Recommendations:

     

    1. Hyper-threading and virtualization should be disabled. (enable virtualization if using VMs)

    2. Power management should be focused towards minimal system intervention and management. Set to Maximum Performance Profile if available on the server

    3. Enable P-states, (unrestricted) Turbo Mode

    4. Disable C-states (or change to C0/C1 preference) and T-states (very important to high bandwith applications like media)

    5. Enabling Turbo mode only on minimum amount of cores is better

     

    C-state

    to disable c-state in your system just add this code snippet s to your application or run this code in a different process while running your application

    int set_low_latfency()
    {
    uint32_t lat = 0;
    fd = open("/dev/cpu_dma_latency", O_RDWR);
    if (fd == -1) {
         fprintf(stderr, "Failed to open cpu_dma_latency: error %s",  strerror(errno));
         return fd;
    }
    write(fd, &lat, sizeof(lat));

    return fd
    }

     

    CPU Frequency

    Check the maximal available frequency for the CPU (useful commands: cpupower frequency-info,   lshw, lscpu)

    Monitor the CPU for activity, and check the current frequency of the cores.

     

    Useful commands to extract CPU core status:

    $ cat /proc/cpuinfo | sed -n '/^processor\|^cpu MHz/p'
    $ turbostat --interval 1

    Our goal is to increase the dedicated cores to their highest available frequency, and prevent the OS from using them while VMA uses them.

     

    Core CPU Frequency not optimized               Core CPU Frequency optimized

    Bad CPU.jpg           Good CPU.jpg

     

    Linux grub.conf file Configuration

    The grub.conf configuration depends on kernel version, distribution and server configurations.

    Add to kernel/linux lines the following parameters (not all are required – consider individually.):

     

    FlagExample (use carefully, depends on system)
    intel_pstateintel_pstate=enable
    intel_idle.max_cstateintel_idle.max_cstate =0
    mcemce=ignore_ce
    processor.max_cstateprocessor.max_cstate=0
    idleidle=poll
    isolcpus

    isolcpus=1-6       (see: cpu affinity)

    Isolating specific cores

    nohz_full

    nohz_full=1-6     (see: cpu affinity)

    Frequent clock ticks cause latency - select which core you want to reduce the interrupts (can't do it on all cores)

    rcu_nocbs

    rcu_nocbs=1-6   (see: cpu affinity)

    The specified CPUs will enter to the offloaded list

    RCU never prevents offloaded CPUs from entering either dyntick-idle mode or adaptive-tick mode.

    nosoftlockupNosoftlockup
    nmi_watchdognmi_watchdog = 0

     

     

    Other OS Tuning

    1. Disable all services not essential to the required task, For example: cups, gpm, ip6tables, mdmonitor, mdmpd, bluetooth, iptables, irqbalance, sysstat.

     

    2. The following services should be enabled, if available: cpuspeed, nscd, crond, ntpd, ntp, network, tuned

     

    3. Set IRQ (interrupt request) affinity, refer to What is IRQ Affinity?

     

    4. Set system profile focused on network-performance/latency.

    $ tuned-adm profile latency-performance

    $ cpupower frequency-set --governor performance

     

    5. In order to check the tuned is running and with the correct policy

    $ tuned-adm active

    See How To Set CPU Scaling Governor to Max Performance (scaling_governor)

     

    6. Turn off numa-balancing

    $ echo 0 > /proc/sys/kernel/numa_balancing

     

    7. Configure tuned.conf

    Add to tuned.conf:

    [bootloader]

    cmdline = audit=0 idle=poll nosoftlockup mce=ignore_ce

    Change in tuned-main.conf:

    sleep_interval = 1  ===> 100

     

    8. Recommended configurations for reducing system scheduling:

    $ echo 100000000 > /proc/sys/kernel/sched_min_granularity_ns

    $ echo 50000000  > /proc/sys/kernel/sched_migration_cost_ns

     

    9. Other Recommended configurations for reducing system scheduling:

    $ echo 0 > /proc/sys/vm/swappiness

    $ sysctl -w vm.swappiness=0

    $ sysctl -w vm.zone_reclaim_mode=0

    echo never > /sys/kernel/mm/transparent_hugepage/enabled

     

    VMA Tuning

     

    Prerequisite

    For specific VMA configurations add the parameters when you run VMA, after LD_PRELOAD, for example:

    $ VMA_SPEC=latency LD_PRELOAD=libvma.so ./my-application

    Selecting the right NUMA and Core

    On machine with two NUMA it's important to select the closest NUMA to the card been used.

    In order to find the NUMA closest to the card:

    $  sudo mst status -v

    NumaSelection.jpg

     

    Checking which core is located on each NUMA:

    $  lscpu

    NUMA_and_CPU.jpg

     

    Defining the NUMA and Core:

    in the example below: binding to NUMA 1 and specific  cores 7 and 9

    $ VMA_SPEC=latency LD_PRELOAD=$VMA_LOAD numactl --cpunodebind=1 taskset -c 7,9 sockperf...........

     

    Tuning VMA for Latency

    1. When running VMA, use CPU-affinity to use the dedicated cores, refer to What is CPU Affinity?

     

    2. Use VMA_SPEC flag for common VMA configurations for latency.

         From VMA’s User Manual, VMA_SPEC:  VMA predefined specification profile:

     

    3. Latency profile spec - optimized latency on all use cases. System is tuned to keep balance between Kernel and VMA. Note: It may not reach the maximum bandwidth

     

    4. Multi ring latency spec - optimized for use cases that are keen on latency where two application communicate using send-only and receive-only TCP sockets

         Example: VMA_SPEC=latency, VMA_SPEC=multi_ring_latency

     

    Memory Allocation Type

    Using Huge Pages can improve system performance by reducing the amount of system resources required to access page table entries.

    Before running VMA, enable Kernel and VMA huge table, for example:

    $ echo 1000000000 > /proc/sys/kernel/shmmax

    $ echo 800 > /proc/sys/vm/nr_hugepages

     

    Note: Increase the amount of shared memory (bytes) and Huge Pages if you receive a warning about insufficient number of huge pages allocated in the system.

     

    While VMA_MEM_ALLOC_TYPE=2 is set, VMA will attempt to allocate data buffers as Huge Pages.

     

    Reducing Memory Footprint

    A smaller memory footprint reduces cache misses thereby improving performance. Configure the following parameters to reduce the memory footprint:

    If your application uses small messages, reduce the VMA MTU using VMA_MTU=200, The default number of RX buffers is 200K. Reduce the amount of RX buffers to 30 – 60K using VMA_RX_BUFS=30000.

     

    Note: This value must not be less than the value of VMA_RX_WRE times the number of offloaded interfaces. The same can be done for TX buffers by changing VMA_TX_BUFS and VMA_TX_WRE

     

     

    Polling Configurations

    You can improve performance by setting the following polling configurations. Increase the number of times to unsuccessfully poll an Rx for VMA packets before going to sleep, using VMA_RX_POLL=200000 or infinite polling, using VMA_RX_POLL=-1. This setting is recommended when Rx path latency is critical and CPU usage is not critical.

     

    Increase the duration in micro-seconds (usec) in which to poll the hardware on Rx path before blocking for an interrupt , using VMA-SELECT-POLL=100000 or infinite polling, using VMA-SELECT-POLL=-1. This setting increases the number of times the selected path successfully receives poll hits, which improves the latency and causes increased CPU utilization.

     

    Disable the following polling parameters by setting their values to 0 using VMA_RX_POLL_OS_RATIO and VMA_SELECT_POLL_OS. When disabled, only offloaded sockets are polled.

     

     

    Handling Single-Threaded Processes

    You can improve performance for single-threaded processes, Change the threading parameter to VMA_THREAD_MODE=0. This setting helps to eliminate VMA locks and improve performance.

    Set VMA_MEM_ALLOC_TYPE. When set, VMA attempts to allocate data buffers as huge pages.