VMA Performance Tuning Guide (old)

Version 6

    Note: This post is outdated, refer to VMA Performance Tuning Guide

     

     

    This post provides guidelines for improving performance with VMA. It is intended for administrators who are familiar with VMA and should be used in conjunction with the VMA User Manual and the VMA Release Notes.

    You can minimize latency by tuning VMA parameters. It is recommended to test VMA performance tuning on an actual application.

    We suggest that you try the following VMA parameters one by one and in combination to find the optimum for your application.

    For more information about each parameter, see the VMA User Manual (see here).

     

    References

     

    General

    To perform tuning, add VMA configuration parameters when you run VMA, after LD_PRELOAD, for example:

    # LD_PRELOAD=libvma.so VMA_MTU=200 ./my-application

     

    Memory Allocation Type

    We recommend using contiguous pages (default). However, in case you want to use huge pages, do the following:

    Before running VMA, enable Kernel and VMA huge table, for example:

    # echo 1000000000 > /proc/sys/kernel/shmmax

    # echo 800 > /proc/sys/vm/nr_hugepages

     

    Note: Increase the amount of shared memory (bytes) and huge pages if you receive a warning about insufficient number of huge pages allocated in the system.

     

    Set VMA_MEM_ALLOC_TYPE. When set, VMA attempts to allocate data buffers as huge pages.

     

    Reducing Memory Footprint

    A smaller memory footprint reduces cache misses thereby improving performance. Configure the following parameters to reduce the memory footprint:

    If your application uses small messages, reduce the VMA MTU using VMA_MTU=200, The default number of RX buffers is 200K. Reduce the amount of RX buffers to 30 – 60K using VMA_RX_BUFS=30000.

     

    Note: This value must not be less than the value of VMA_RX_WRE times the number of offloaded interfaces. The same can be done for TX buffers by changing VMA_TX_BUFS and VMA_TX_WRE

     

    Polling Configurations

    You can improve performance by setting the following polling configurations. Increase the number of times to unsuccessfully poll an Rx for VMA packets before going to sleep, using VMA_RX_POLL=200000 or infinite polling, using VMA_RX_POLL=-1. This setting is recommended when Rx path latency is critical and CPU usage is not critical.

     

    Increase the duration in micro-seconds (usec) in which to poll the hardware on Rx path before blocking for an interrupt , using VMA-SELECT-POLL=100000 or infinite polling, using VMA-SELECT-POLL=-1. This setting increases the number of times the selected path successfully receives poll hits, which improves the latency and causes increased CPU utilization.

     

    Disable the following polling parameters by setting their values to 0 using VMA_RX_POLL_OS_RATIO and VMA_SELECT_POLL_OS. When disabled, only offloaded sockets are polled.

     

    Handling Single-Threaded Processes

    You can improve performance for single-threaded processes, Change the threading parameter to VMA_THREAD_MODE=0. This setting helps to eliminate VMA locks and improve performance.