ISER Performance Tuning and Benchmark

Version 23

    This post shows a simple tuning procedure and benchmark steps for both target and initiator to optimize iSER block storage performance in a multi-core Linux environment.

     

    References

     

    Overview

    Hyper-threading

    This is relevant for both Initiators and Targets. Disabling hyper-threading can help performance. This setting is done via the BIOS configuration (see Performance Tuning Guide for more details).

     

    IRQ balancer

    This is relevant for both Initiators and Targets. On Linux, interrupts are handled by the kernel. irqbalancer is a process that is responsible for balancing interrupts across processors. When working with a modern device that utilizes per-core interrupt (via MSIX), in most cases performance will be better when disabling this IRQ balancer (and setting IRQ affinity as explained next).
    IRQ balancer is ON by default on most systems. Note that disabling the IRQ may hurt performance of other devices in the system that are not optimized to a multi-core environment.

    IRQ affinity settings

    This is relevant for both Initiators and Targets. Spread MSIX interrupt vectors across all system cores.
    See more on MSIX and IRQ affinity at MSI-HOWTO.txt [LWN.net].
    For example:
    The following bash script will evenly spread all mlx4 and mlx5 devices IRQs between all available cores. It should be used once after boot. Note that if IRQ balancer is not stopped, it may interfere with these settings.

    #!/bin/bash

    IRQS=$(cat /proc/interrupts | egrep 'mlx4|mlx5' | awk '{print $1}' | sed 's/://')

    cores=($(seq 1 $(grep -c processor /proc/cpuinfo)))

    i=0

    for IRQ in $IRQS

    do

      core=${cores[$i]}

      let "mask=2**(core-1)"

      echo $(printf "%x" $mask) > /proc/irq/$IRQ/smp_affinity

      let "i+=1"

      if [[ $i ==${#cores[@]} ]]; then i=0

      fi

    done

               

    Alternatively Mellanox OFED provides scripts to set these settings automatically as well (see Performance Tuning Guide).

     

    CPU scaling

    This is relevant for both Initiators and Targets. Setting cpu scaling_governor parameter to performance (if supported). Note that this will make the CPU consume more power also during idle times.

    # echo performance > /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor

                

     


    Block layer staging

    Relevant for Initiators only. For each block device set:

     

     

    # echo noop > /sys/block/$dev/queue/scheduler

               

     

    This will set the IO scheduler to do no-operation. IO schedulers try to accelerate HDD access time by minimizing seeks. When working with SAN targets normally it is better to let the target machine do these optimizations if needed (normally a single LUN is not made of a single HDD...). In addition, SDDs do not suffer from seek time.

    # echo 2 > /sys/block/$dev/queue/nomerges

    Normally the block layer will try to merge IOs to consecutive offsets. On fast SAN networks it may be better not to merge, and save the CPU utilization.

    # echo 0 > /sys/block/$dev/queue/add_random

     

    The system uses physical devices to gather randomness for its random numbers generator. Can save some utilization by turning this off.

    # echo 1 > /sys/block/$dev/queue/rq_affinity

     

    Deliver IO completion on the same core that handled the request.

    For more information, see RedHat's performance tuning recommendations:

    Chapter 5. Storage and File Systems, and specifically sections 5.1.2 and 5.3.6.

     

    Huge pages settings

    This is relevant for Targets only. In case you are running user-space targets such as TGT, set a large number of HugePages in order to help cache hit rate. For kernel-space targets such as LIO and SCST it is recommended to reduce the number of huge pages to minimum (even to 0) in order to allow more space for the page-cache.

    Benchmark Example

    1.85 MIOPs between a single iSER initiator and a single iSER target (TGT).

     

    Hardware:
    • HP ProLiant SL230s Gen8
    • 32GB RAM
    • 2xNUMA node
    • 16xCPU Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 CPUs per NUMA node)
    • PCI Gen-3
    • ConnectX3
    • Single link 40GbE/56Gb FDR
    • Firmware version: 2.32.1078

     

    Software:
    • Initiator:
      • OS: Red Hat Enterprise Linux Server release 6.4 (Santiago)
      • Kernel: 2.6.32-358.el6.x86_64
      • iser-1.5.0 (provided by Mellanox)
      • libaio-0.3.107-10.el6.x86_64
      • fio-2.1.4-1.el6.rf.x86_64
    • Target:
      • OS: Red Hat Enterprise Linux Server release 7.0 (Maipo)

      • Kernel: 3.10.0-123.el7.x86_64

      • scsi-target-utils-1.0.44-0.x86_64

     

    NOTE: In order to reach IOPs performance over RoCE, a mlx4_core source code modification is needed (not available in MLNX OFED nor upstream yet). The modification allows RDMA applications to share completion vectors with mlx4_en. Without this modification iSER can only use 3 completion vectors and won't be able to scale up to 2M IOPs. On InfiniBand no modification is required.

     

    TGT server

    1. Disable hyper-threading in the BIOS
    2. Stop IRQ balancer

    # service irqbalance stop

               
    3. Spread MSIX interrupt vectors across all system cores as explained above.
    4. Set cpu scaling_governor parameter to performance

    # echo performance > /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor

     

    5. Set nv_hugepages

    In case you are running user-space target such as TGT, set a large number of HugePages (helps cache hit rates)

    #echo 3000 > /proc/sys/mv/nr_hugepages

     

    In case of kernel space target suech as LIO or SCST, set a low (or zero) number of HugePages (Increases page-cache)

    #echo 0 > /proc/sys/vm/nr_hugepages

     

    6. Create 2 TGT instances (each using a different CQ vector) on the TGT server.

    port=3260

    for i in 0 1

    do

        taskset -c $i tgtd -C $port --iscsi portal=*:$port --iser port=$port cq_vector=$i

        let "port+=1"

    done

     

    7.  Create 16 logical targets over 2 TGT (8 per TGT instance) on the TGT server.

    for port in 3260 3261

    do

        for i in `seq 1 8`

        do

              tgt-setup-lun -n tgt-$i -d /tmp/null -b null -t iser -C $port

        done

    done

               

     

    Note: that the "-C $port" parameter is supported since tgt v1.0.31

     

    Initiator

    1. Disable hyper-threading in the BIOS

     

    2. Stop IRQ balancer

    # service irqbalance stop          

     

    3. Spread MSIX interrupt vectors across all system cores as explained above.

     

    4. Set cpu scaling_governor parameter to performance

    # echo performance > /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor          

    5. Disable memory registrations for continuous memory regions
    # modprobe ib_iser always_register=N

     

    6.  Discover and login to all targets

    for port in 3260 3261

    do

        iscsiadm -m discovery -t st -p <ip>:$port -I iser -l

    done

     

     

    7. For each block device minimize staging effects.

    for dev in $DEVS

    do

      echo noop > /sys/block/$dev/queue/scheduler

        echo 2 > /sys/block/$dev/queue/nomerges

        echo 0 > /sys/block/$dev/queue/add_random

    done

    8. Run IO using fio command on all devices

    #fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=99999999 --time_based --loops=1

    --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --exitall \

    --name task_sdb --filename=/dev/sdb \

    --name task_sdc --filename=/dev/sdc \

    --name task_sdd --filename=/dev/sdd \

    --name task_sde --filename=/dev/sde \

    --name task_sdf --filename=/dev/sdf \

    --name task_sdg --filename=/dev/sdg \

    --name task_sdh --filename=/dev/sdh \

    --name task_sdi --filename=/dev/sdi \

    --name task_sdj --filename=/dev/sdj \

    --name task_sdk --filename=/dev/sdk \

    --name task_sdl --filename=/dev/sdl \

    --name task_sdm --filename=/dev/sdm \

    --name task_sdn --filename=/dev/sdn \

    --name task_sdo --filename=/dev/sdo \

    --name task_sdp --filename=/dev/sdp \

    --name task_sdq --filename=/dev/sdq