Rivermax Linux Performance Tuning Guide [1]

Version 2

    This post provides guidelines for improving performance with Rivermax for Linux. It is intended for the Rivermax users and should be used in conjunction with the Rivermax User Manual and the Rivermax Release Notes. You can maximize throughput by tuning your server and by that achieve more with one Mellanox NIC. Most of these recommendations were tested by our performance team but we do encourage you to test the influence on your setup.

     

    References

     

    Server Tuning

    In order for a server to receive high bandwidth we need to verify the PCI bus is configured to support high width (for more than 50G we need x16).

    in order to see you supported PCI width do the following:

    1. run

         sudo mst status -v

    2. locate the PCI address you are using according the the network interface.

    3. run

         sudo lspci -vvv -s[PCI_ADDR]

    4. verify under LnkSta line that the same Width is written as under the LnkCap line

     

    see example below where width supported and used are x16

    [root@r-aa-a]$ sudo mst status -v

    MST modules:

    ------------

        MST PCI module is not loaded

        MST PCI configuration module loaded

    PCI devices:

    ------------

    DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA

    ConnectX5(rev:0)        /dev/mst/mt4121_pciconf0.1    03:00.1   mlx5_1          net-ens1f1                0

    ConnectX5(rev:0)        /dev/mst/mt4121_pciconf0      03:00.0   mlx5_0          net-ens1f0                0

    [root@r-aa-a]$ sudo lspci -vvv -s03:00.0

    LnkCap: Port #0, Speed unknown, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited

    LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

     

     

    BIOS Tuning

    Follow  Understanding BIOS Configuration for Performance Tuning  Recommendations:

     

    1. Hyper-threading and virtualization should be disabled. (enable virtualization if using VMs)

    2. Power management should be focused towards minimal system intervention and management. Set to Maximum Performance Profile if available on the server

    3. Enable P-states, (unrestricted) Turbo Mode

    4. Disable C-states (or change to C0/C1 preference) and T-states (very important to high bandwith applications like media)

    5. Enabling Turbo mode only on minimum amount of cores is better

     

    C-state

    to disable c-state in your system just add this code snippet s to your application or run this code in a different process while running your application

    int set_low_latfency()
    {
    uint32_t lat = 0;
    fd = open("/dev/cpu_dma_latency", O_RDWR);
    if (fd == -1) {
         fprintf(stderr, "Failed to open cpu_dma_latency: error %s",  strerror(errno));
         return fd;
    }
    write(fd, &lat, sizeof(lat));

    return fd
    }

    disable pause frames

    ethtool -A [interface] rx off tx off

    CPU Frequency

    Check the maximal available frequency for the CPU (useful commands: cpupower frequency-info,   lshw, lscpu)

    Monitor the CPU for activity, and check the current frequency of the cores.

     

    Useful commands to extract CPU core status:

    $ cat /proc/cpuinfo | sed -n '/^processor\|^cpu MHz/p'
    $ turbostat --interval 1

    Our goal is to increase the dedicated cores to their highest available frequency, and prevent the OS from using them while VMA uses them.

     

    Core CPU Frequency not optimized               Core CPU Frequency optimized

    Bad CPU.jpg           Good CPU.jpg

     

    Linux grub.conf file Configuration

    The grub.conf configuration depends on kernel version, distribution and server configurations.

    Add to kernel/linux lines the following parameters (not all are required – consider individually.):

     

    FlagExample (use carefully, depends on system)
    intel_pstateintel_pstate=enable
    intel_idle.max_cstateintel_idle.max_cstate =0
    mcemce=ignore_ce
    processor.max_cstateprocessor.max_cstate=0
    idleidle=poll
    isolcpus

    isolcpus=1-6       (see: cpu affinity)

    Isolating specific cores

    nohz_full

    nohz_full=1-6     (see: cpu affinity)

    Frequent clock ticks cause latency - select which core you want to reduce the interrupts (can't do it on all cores)

    rcu_nocbs

    rcu_nocbs=1-6   (see: cpu affinity)

    The specified CPUs will enter to the offloaded list

    RCU never prevents offloaded CPUs from entering either dyntick-idle mode or adaptive-tick mode.

    nosoftlockupNosoftlockup
    nmi_watchdognmi_watchdog = 0

     

     

    Other OS Tuning

    1. Disable all services not essential to the required task, For example: cups, gpm, ip6tables, mdmonitor, mdmpd, bluetooth, iptables, irqbalance, sysstat.

     

    2. The following services should be enabled, if available: cpuspeed, nscd, crond, ntpd, ntp, network, tuned

     

    3. Set IRQ (interrupt request) affinity, refer to What is IRQ Affinity?

     

    4. Set system profile focused on network-performance/latency.

    $ tuned-adm profile network-throughput

    $ cpupower frequency-set --governor performance

     

    5. In order to check the tuned is running and with the correct policy

    $ tuned-adm active

    See How To Set CPU Scaling Governor to Max Performance (scaling_governor)

     

    6. Turn off numa-balancing

    $ echo 0 > /proc/sys/kernel/numa_balancing

     

    7. Configure tuned.conf

    Add to tuned.conf:

    [bootloader]

    cmdline = audit=0 idle=poll nosoftlockup mce=ignore_ce

    Change in tuned-main.conf:

    Outliers max improvements
    How long to sleep before checking for events (in seconds) higher number means lower overhead but longer response time.
    sleep_interval = 1  ===> change to 100

    Update interval for dynamic tunings (in seconds). It must be multiply of the sleep_interval.
    update_interval = 10  ===> change to 10000

     

    8. Recommended configurations for reducing system scheduling:

    $ echo 100000000 > /proc/sys/kernel/sched_min_granularity_ns

    $ echo 50000000  > /proc/sys/kernel/sched_migration_cost_ns

     

    9. Other Recommended configurations for reducing system scheduling:

    $ echo 0 > /proc/sys/vm/swappiness

    $ sysctl -w vm.swappiness=0

    $ sysctl -w vm.zone_reclaim_mode=0

    $ echo never > /sys/kernel/mm/transparent_hugepage/enabled

     

    Selecting the right NUMA and Core

    On machine with two NUMA it's important to select the closest NUMA to the card been used.

    In order to find the NUMA closest to the card:

    $  sudo mst status -v

    NumaSelection.jpg

     

    Checking which core is located on each NUMA:

    $  lscpu

    NUMA_and_CPU.jpg

     

    Huge pages

    Using Huge Pages can improve system performance by reducing the amount of system resources required to access page table entries.

    Before running Rivermax, enable hugepages:

    $ echo 1000000000 > /proc/sys/kernel/shmmax

    $ echo 800 > /proc/sys/vm/nr_hugepages