NBDX Performance Tuning and Benchmark

Version 10

    This post shows a simple setup, tuning procedure and benchmark steps for both server and client to optimize NBDX performance in a multi-core Linux environment over InfiniBand link layer.

     

    Note: as PCIe over Fabrics spec was published, nbdX becomes obsolete. For more information about NVMe over Fabrics configuration, refer to HowTo Configure NVMe over Fabrics.

     

    References

     

    BIOS Tuning

    It is recommended to tune the BIOS to maximum performance. For further information, refer to BIOS Performance Tuning Example and Mellanox Performance Tuning Guide.

     

    IRQ Balancer

    This is relevant to both the client and the server. On Linux OS, interrupts are handled by the kernel. irqbalancer is a process that is responsible for balancing interrupts across processors. When working with a modern device that utilizes per-core interrupts (via MSIX), in most cases performance will be better when disabling the IRQ balancer and setting the IRQ affinity.

    Note: The IRQ balancer is ON by default on most systems, disabling it may harm performance of other devices in the system that are not optimized to a multi-core environment.

     

    IRQ Affinity Settings

    This is relevant to both the client and the server. Need to make sure to spread the MSIX interrupt vectors across all system cores.
    For further information on MSIX and IRQ affinity, refer to MSI-HOWTO.txt.
    For more information on block layer tuning see ISER Performance Tuning and Benchmark

     

    Also see RedHat's performance tuning recommendations Storage and File Systems, specifically sections 5.1.2 and 5.3.6.

     

    Benchmark Setting Example

    Use this example to reach 5.7 Million IOPs between a single port B2B connected using NBDX client and raio_server.

     

    Hardware:
    • HP ProLiant DL380 Gen9
    • 128GB RAM
    • 2xNUMA node
    • 2xThreads per core (hyper-threading is enabled)
    • 12xCores per socket
    • 48xCPU Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    • PCI Gen-3
    • ConnectX4
    • Single link 100Gb EDR
    • Firmware version: 12.12.1100

     

    Software:
    • OS: Red Hat Enterprise Linux Server release7.1
    • MLNX_OFED_LINUX-3.1-1.0.3
    • Kernel: 3.10.0-229.el7.x86_64
    • Accelio for_next - 9cea8291787b72a746e42964d5de42d6d48f0e0d commit
    • fio-2.2.9-45-gc6d1

     

    Note: NBDX code was modified specifically for these performance tests for the inbox kernel, Thus, it is recommended to install kernel 3.19 and above in order to run NBDX over RH7.1. Without this modification, the NBDX code cannot be compiled vs.the inbox RH7.1 kernel.

     

    Perform the following configuration on the server side:

    1. Stop the IRQ balancer.

    # service irqbalance stop

     

    2. Spread the MSIX interrupt vectors across all system cores.

    # set_irq_affinity_cpulist.sh 0-47 ib0

     

    3. Create 12 raio_server instances (each using a different CPU mask) with 4 threads per process.

    port=5000

    mask=f

    for i in `seq 12`

    do

       raio_server -a 12.137.166.1 -p $port -t rdma -c $mask -f 0 -n 4 &

        mask+=0

        port=$[$port+100]

    done

     

    Perform the following configuration on the client side:

    1. Stop the IRQ balancer.

    # service irqbalance stop

     

    2. Spread the MSIX interrupt vectors across all system cores

    # set_irq_affinity_cpulist.sh 0-47 ib0

     

    3. Create 12 NBDX hosts (one per each raio_server process).

    port=5000

    for i in `seq 12`

    do

        nbdxadm -o create_host -i $i -p "12.137.166.1:$port"

        port=$[$port+100]

    done

     

    4. Create 12 NBDX devices (one per each NBDX host).

    for i in `seq 12`

    do

        nbdxadm -o create_device -i $i -d $i -f /dev/null

    done

     

    5. Run IO using the fio command on all devices.

    #fio --group_reporting --rw=randread --bs=512 --numjobs=4 --iodepth=64 --runtime=99999999 --time_based --loops=1

    --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --exitall \

    --name task_1 --filename=/dev/nbdx1 \

    --name task_2 --filename=/dev/nbdx2 \

    --name task_3 --filename=/dev/nbdx3 \

    --name task_4 --filename=/dev/nbdx4 \

    --name task_5 --filename=/dev/nbdx5 \

    --name task_6 --filename=/dev/nbdx6 \

    --name task_7 --filename=/dev/nbdx7 \

    --name task_8 --filename=/dev/nbdx8 \

    --name task_9 --filename=/dev/nbdx9 \

    --name task_10 --filename=/dev/nbdx10 \

    --name task_11 --filename=/dev/nbdx11 \

    --name task_12 --filename=/dev/nbdx12