Simple NVMe-oF Target Offload Benchmark

Version 5

    Simple NVMe-over Fabrics (NVMe-oF) Target Offload Benchmark

     

    This post describes an NVMe-oF Target offload benchmark test, indicating a number of performance improvements. These include a reduction in CPU utilization (0% in I/O path) and fewer interrupts and context switches.

    Results were achieved using PCI peer-to-peer capabilities for transferring data from/to the network to/from the NVMe drive without any software involvement.

     

                                              P2P_DIAGRAM.PNG

     

    References

     

    Setup

    In this benchmark test, we use 2 servers installed with ConnectX-5 Dual port, connected on both ports to each other, back to back.

    For the test, we configure one of the ports to run NVMe-oF target offload, while the other port is configured to run NVMe-oF, without offload.

    Make sure you install MLNX_OFED 4.1 or later.

     

     

    Target Configuration

    To improve performance, prior to NVMEoF target configuration, set mem/memmap boot parameters to enable the nvmet_rdma module to use unmapped contiguous memory.

    For example:

    In a system with 64G RAM, set mem=59392M memmap=59392M and reboot your server (you can verify yourself by using "grep MemTotal /proc/meminfo" command after booting again).

    Here is an example script for such configuration.

    !#/bin/bash

     

    echo "...Setting up NVME Target"

     

    echo "1. enabling nvme modules"

    modprobe nvme num_p2p_queues=2

    modprobe nvmet

    modprobe nvmet-rdma offload_mem_start=0xf00000000 offload_mem_size=2048 offload_buffer_size=512

     

    echo "2. enable subsystem-1 port 1 IP 3.3.3.6 - NVME target offload"

     

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1

    echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path

    echo "0000:85:00.0" > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

    mkdir /sys/kernel/config/nvmet/ports/1

    echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

    echo 3.3.3.6 > /sys/kernel/config/nvmet/ports/1/addr_traddr

    echo "rdma" > /sys/kernel/config/nvmet/ports/1/addr_trtype

    echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam

    ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

     

     

    echo "3. enable subsystem-2 port 2 IP 4.4.4.6 - NVME target not offload"

     

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem2

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem2/attr_allow_any_host

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem2/namespaces/2

    echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem2/namespaces/2/device_path

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem2/namespaces/2/enable

    mkdir /sys/kernel/config/nvmet/ports/2

    echo 4420 > /sys/kernel/config/nvmet/ports/2/addr_trsvcid

    echo 4.4.4.6 > /sys/kernel/config/nvmet/ports/2/addr_traddr

    echo "rdma" > /sys/kernel/config/nvmet/ports/2/addr_trtype

    echo "ipv4" > /sys/kernel/config/nvmet/ports/2/addr_adrfam

    ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem2/ /sys/kernel/config/nvmet/ports/2/subsystems/testsubsystem2

     

    echo "... done"

     

    Client Connection

    Here is an example script for client connectivity.

     

    modprobe nvme

    modprobe nvme-rdma

     

    nvme discover -t rdma -a 3.3.3.6 -s 4420

    nvme connect -t rdma -n testsubsystem -a 3.3.3.6 -s 4420

     

    nvme discover -t rdma -a 4.4.4.6 -s 4420

    nvme connect -t rdma -n testsubsystem2 -a 4.4.4.6 -s 4420

     

     

     

    Benchmark

     

    1. Run nvme list to see the connected devices.

    # nvme list

    Node             SN                   Model                                    Namespace Usage                      Format           FW Rev 

    ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

    /dev/nvme0n1     ed7ad76581c89bde     Linux                                    1         480.10  GB / 480.10  GB    512   B +  0 B   4.8.7

    /dev/nvme1n1     a9ae518980908038     Linux                                    2         480.10  GB / 480.10  GB    512   B +  0 B   4.8.7

     

    In this example, we created two devices:

    • /dev/nvme0n1: This is the offloaded nvme target device

    • /dev/nvme1n1: This is the  nvme target device that is not offloaded

     

    There are several ways to run benchmark traffic and check CPU utilization; in this example, we use fio for benchmark testing and top + vmstat for monitoring.

     

    2. Run fio on the client server to the offloaded device /dev/nvme0n1 and

     

    # fio --bs=64k --numjobs=16 --iodepth=4 --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --time_based --runtime=60 --filename=/dev/nvme0n1  --name=read-phase --rw=randread

     

    3.  Run on the target server  top and vmstat tools.

     

    top

    top - 16:30:49 up 40 min,  2 users,  load average: 0.02, 0.02, 0.00

    Tasks: 733 total,   1 running, 732 sleeping,   0 stopped,   0 zombie

    %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

    KiB Mem : 13191878+total, 12301135+free,  8292068 used,   615368 buff/cache

    KiB Swap:  2047996 total,  2047996 free,        0 used. 12255074+avail Mem

     

     

       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                               

         8 root      20   0       0      0      0 S   0.3  0.0   0:01.19 rcu_sched                                                                                                                             

      1022 root      20   0       0      0      0 S   0.3  0.0   0:02.66 kworker/u288:6                                                                                                                        

      9969 root      20   0   43124   4032   2752 R   0.3  0.0   0:00.06 top                                                                                                                                   

         1 root      20   0   45984   9904   3764 S   0.0  0.0   0:09.79 systemd                                                                                                                               

         2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd                                                                                                                              

         3 root      20   0       0      0      0 S   0.0  0.0   0:00.06 ksoftirqd/0                                                                                                                           

         5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                          

         9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh 

     

    Nothing special, almost zero CPU load.

     

    vmstat -t -1

    # vmstat -t 1

    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----

    r  b   swpd   free      buff  cache   si   so    bi    bo   in   cs us sy id wa  st            PDT

    0  0      0 123014016  35272 580972    0    0     0     0  490 1535  0  0 100  0  0 2017-06-23 16:31:57

    0  0      0 123014016  35272 580972    0    0     0     0  438 1517  0  0 100  0  0 2017-06-23 16:31:58

    0  0      0 123013872  35272 580972    0    0     0     0  474 1505  0  0 100  0  0 2017-06-23 16:31:59

    0  0      0 123013872  35272 580972    0    0     0     0  407 1405  0  0 100  0  0 2017-06-23 16:32:00

    0  0      0 123013872  35272 580972    0    0     0     0  429 1322  0  0 100  0  0 2017-06-23 16:32:01

    0  0      0 123013872  35272 580972    0    0     0     0  434 1536  0  0 100  0  0 2017-06-23 16:32:02

    0  0      0 123013872  35272 580972    0    0     0     0  478 1505  0  0 100  0  0 2017-06-23 16:32:03

    0  0      0 123013984  35272 580972    0    0     0     0  437 1547  0  0 100  0  0 2017-06-23 16:32:04

    0  0      0 123014032  35272 580940    0    0     0     0  482 1528  0  0 100  0  0 2017-06-23 16:32:05

     

    Check the Columns:

    • in (interrupts):   ~450
    • cs (context switch)  ~1500

     

    4. Run fio on the client server to the NVMe device /dev/nvme1n1 that is not offloaded (the other port).

    # fio --bs=64k --numjobs=16 --iodepth=4 --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --time_based --runtime=60 --filename=/dev/nvme1n1  --name=read-phase --rw=randread

     

    5.  Run on the target server  top and vmstat tools.

     

    top

    top - 16:38:45 up 48 min,  2 users,  load average: 0.05, 0.03, 0.00

    Tasks: 730 total,   2 running, 728 sleeping,   0 stopped,   0 zombie

    %Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

    KiB Mem : 13191878+total, 12301384+free,  8288616 used,   616332 buff/cache

    KiB Swap:  2047996 total,  2047996 free,        0 used. 12255352+avail Mem

     

     

       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                               

      6572 root       0 -20       0      0      0 R   2.0  0.0   0:00.55 kworker/42:1H                                                                                                                         

      1488 root       0 -20       0      0      0 S   0.7  0.0   0:00.27 kworker/48:1H                                                                                                                         

      1489 root       0 -20       0      0      0 S   0.7  0.0   0:00.57 kworker/25:1H                                                                                                                         

      1830 root       0 -20       0      0      0 S   0.7  0.0   0:00.28 kworker/18:1H                                                                                                                         

      1987 root       0 -20       0      0      0 S   0.7  0.0   0:00.48 kworker/16:1H                                                                                                                         

      2532 root       0 -20       0      0      0 S   0.7  0.0   0:00.67 kworker/17:1H                                                                                                                         

      2707 root       0 -20       0      0      0 S   0.7  0.0   0:00.06 kworker/46:1H                                                                                                                         

      2785 root       0 -20       0      0      0 S   0.7  0.0   0:00.56 kworker/23:1H                                                                                                                         

      6402 root       0 -20       0      0      0 S   0.7  0.0   0:00.66 kworker/53:1H                                                                                                                         

      6592 root       0 -20       0      0      0 S   0.7  0.0   0:00.06 kworker/55:1H                                                                                                                         

      8128 root       0 -20       0      0      0 S   0.7  0.0   0:00.56 kworker/54:1H                                                                                                                         

      8129 root       0 -20       0      0      0 S   0.7  0.0   0:00.27 kworker/51:1H                                                                                                                         

      1026 root      20   0       0      0      0 S   0.3  0.0   0:04.71 kworker/u288:8                                                                                                                        

      1684 root      20   0   20228   3612   2312 S   0.3  0.0   0:04.36 irqbalance                                                                                                                            

      1769 root      20   0    4380   1432   1344 S   0.3  0.0   0:00.54 rngd                                                                                                                                  

      2953 root       0 -20       0      0      0 S   0.3  0.0   0:00.16 kworker/19:1H                                                                                                                         

      5120 root       0 -20       0      0      0 S   0.3  0.0   0:00.52 kworker/20:1H                                                                                                                         

      6521 root      20   0  145440   8784   7464 S   0.3  0.0   0:00.58 sshd                                                                                                                                  

      7638 root       0 -20       0      0      0 S   0.3  0.0   0:00.42 kworker/27:1H            

     

    Suddenly, you get a list of kworker processes that consumes CPU cycles.

     

    vmstat -t -1

    # vmstat -t 1

    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp------------

    r  b   swpd   free      buff  cache   si   so    bi     bo   in   cs    us sy id wa st              PDT

    0  0      0 123020864  35316 581580    0    0 295168     0 26762 19392  0  0 100  0  0 2017-06-23 16:40:39

    0  0      0 123021352  35316 581644    0    0 296576     0 26743 19856  0  0 100  0  0 2017-06-23 16:40:40

    1  0      0 123019936  35316 581724    0    0 300992     0 27711 21045  0  0 99   0  0 2017-06-23 16:40:41

    0  0      0 123020544  35316 581668    0    0 298432     0 26983 19744  0  0 100  0  0 2017-06-23 16:40:42

    1  0      0 123020656  35316 581668    0    0 295232    16 26580 19549  0  0 100  0  0 2017-06-23 16:40:43

    2  0      0 123021088  35316 581668    0    0 310976     0 27920 20365  0  0 100  0  0 2017-06-23 16:40:44

    0  0      0 123020960  35316 581668    0    0 289120     0 26297 19258  0  0 100  0  0 2017-06-23 16:40:45

    0  0      0 123020832  35324 581660    0    0 236448    16 21475 16191  0  0 100  0  0 2017-06-23 16:40:46

     

    Check the Columns:

    • in (interrupts) : ~27,000
    • cs (context switch): ~20,000

     

    Compare the results.