HowTo Configure NVMe over Fabrics

Version 56

    This post is a quick guide to bring up NVMe over Fabrics host to target association using RDMA transport layer.

    NVMEoF can run over any RDMA capable adapter (e.g. ConnectX-3/ConnectX-4) using IB/RoCE link layer.

     

    >>Learn for free about Mellanox solutions and technologies in the Mellanox Online Academy

     

    Note: This post focus on NVMEoF configuration for the target and host, and assumes that the RDMA layer is enabled. Refer to RDMA/RoCE Solutions for topics related to the RDMA layer.

     

    References

     

    Setup

    • Two servers, one configured as NVMe target, and the other used as NVMe host (initiator).
    • In this example, the servers were configured with CentOS v7.2 and kernel v4.8.7.

     

     

    Configuration Video By Mellanox Academy

     

     

     

    Before you Start

     

    Using MLNX_OFED

    Note that MLNX_OFED does not necessarily have to be installed on the servers. In case MLNX_OFED is needed, install v3.4.2 or later.

    See HowTo Install MLNX_OFED Driver and make sure to install it with the --add-kernel-support  and --with-nvmf flags.

     

    # ./mlnxofedinstall --add-kernel-support --with-nvmf

     

     

    Benchmarks

    Make sure that the RDMA layer is configured correctly and that it is running.

    Test the RDMA performance using one of the methods, see for example: HowTo Enable, Verify and Troubleshoot RDMA.

    In case MLNX_OFED is not installed for RDMA benchmark testing, followHowTo Enable Perftest Package for Upstream Kernel to verify that the RDMA layer is working correctly using the perftest package (ib_send_bw,ib_write_bw ...)

     

    InfiniBand Network Considerations

    This post discusses the Ethernet network. InfiniBand configuration is the same as the network type is flexible.

    To enable NVMEoF over an InfiniBand network:

    • Set the port type to be InfiniBand
    • Make sure that SM is running in the network.
    • The configuration (e.g. IP address) should be aligned with the IPoIB interface (e.g. ib0 instead of enp2s0f0 in the example below).

     

    Prerequisites

    1. Follow HowTo Compile Linux Kernel for NVMe over Fabrics and make sure that you have nvme modules on the client and target servers.

     

    2. Make sure that the mlx4 (ConnectX-3/ConnectX-3 Pro) or mlx5 (ConnectX-4/ConnectX-4 Lx) drivers are loaded.

     

    mlx4 Driver Example:

    # modprobe mlx4_core

     

    # lsmod | grep mlx

    mlx4_ib               148806  0

    ib_core               195846  13 rdma_cm,ib_cm,iw_cm,rpcrdma,mlx4_ib,ib_srp,ib_ucm,ib_iser,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert

    mlx4_en                97313  0

    ptp                    12434  1 mlx4_en

    mlx4_core             294165  2 mlx4_en,mlx4_ib

     

    mlx5 Driver Example:

    # modprobe mlx5_core

     

    # lsmod | grep mlx

    mlx5_ib               167936  0

    ib_core               208896  14 ib_iser,ib_cm,rdma_cm,ib_umad,ib_srp,ib_isert,ib_uverbs,rpcrdma,ib_ipoib,iw_cm,mlx5_ib,ib_srpt,ib_ucm,rdma_ucm

    mlx5_core             188416  1 mlx5_ib

     

    3. On the target server, load nvmet and nvmet-rdma kernel modules.

    # modprobe nvmet

    # modprobe nvmet-rdma

    # modprobe nvme-rdma        <-- This is to run a client on the target server (if needed)

     

    # lsmod | grep nvme

    nvmet_rdma             24576  1

    nvmet                  49152  7 nvmet_rdma

    rdma_cm                53248  2 rdma_ucm,nvmet_rdma

    ib_core               237568  11 ib_cm,rdma_cm,ib_umad,ib_uverbs,ib_ipoib,iw_cm,mlx5_ib,ib_ucm,rdma_ucm,nvmet_rdma,mlx4_ib

    mlx_compat             16384  16 ib_cm,rdma_cm,ib_umad,ib_core,ib_uverbs,nvmet,mlx4_en,ib_ipoib,mlx5_core,iw_cm,mlx5_ib,mlx4_core,ib_ucm,rdma_ucm,nvmet_rdma,mlx4_ib

    nvme                   28672  2

    nvme_core              36864  3 nvme

     

    4. On the client server, load nvme-rdma kernel module.

    # modprobe nvme-rdma


    # lsmod | grep nvme

    nvme_rdma              28672  0

    nvme_fabrics           20480  1 nvme_rdma

    nvme                   28672  0

    nvme_core              49152  3 nvme_fabrics,nvme_rdma,nvme

    rdma_cm                53248  2 nvme_rdma,rdma_ucm

    ib_core               237568  11 ib_cm,rdma_cm,ib_umad,nvme_rdma,ib_uverbs,ib_ipoib,iw_cm,mlx5_ib,ib_ucm,rdma_ucm,mlx4_ib

    mlx_compat             16384  18 ib_cm,rdma_cm,ib_umad,nvme_fabrics,ib_core,nvme_rdma,ib_uverbs,nvme,nvme_core,mlx4_en,ib_ipoib,mlx5_core,iw_cm,mlx5_ib,mlx4_core,ib_ucm,rdma_ucm,mlx4_ib

     

    NVME Target Configuration

     

    Prerequisites

    • NVMEoF Target requires a backing store block device to be used as a storage space (Null block device can also be used as a backing store for benchmarking).
    • Make sure you have a proper block device to assign for the nvmet-rdma subsystem.
    • Check using ibtools that your target has a valid RDMA device with an IP address configured.
    • In case you use InfiniBand, make sure that OpenSM is running.

    For more information about NVMe subsystem, refer to: http://www.nvmexpress.org/specifications/

     

    1. Create nvmet-rdma subsystem. Run the ‘mkdir /sys/kernel/config/nvmet/subsystems/<name_of_subsystem>’ command. Select any name.

    # mkdir /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name

    # cd /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name

     

    2. Allow any host to be connected to this target.

    # echo 1 > attr_allow_any_host

     

    Note: ACLs are supported, yet they are not described in this post.

     

    3. Create a namespace inside the subsystem using the ‘mkdir namespaces/<ns_num>' command, where ns_num is the namespace number to create (similar to lun).

    # mkdir namespaces/10

    # cd namespaces/10

     

    4. Set the path to the NVMe device (e.g. /dev/nvmeon1) and enable the namespace.

    # echo -n /dev/nvme0n1> device_path

    # echo 1 > enable

     

    Note: The enabling command will not work in case you do not have NVMe a device installed. For NVMEoF benchmark networking you can use null block device instead.

    # modprobe null_blk nr_devices=1

     

    #  ls /dev/nullb0

    /dev/nullb0

     

    # echo -n /dev/nullb0 > device_path

    # echo 1 > enable

     

    5. Create the following directory with an NVMe port. Any port number can be set. Use the 'mkdir /sys/kernel/config/nvmet/ports/<number_of_port>’ command.

    # mkdir /sys/kernel/config/nvmet/ports/1

    # cd /sys/kernel/config/nvmet/ports/1

     

    6. Set the IP address of the relevant port, using the ‘echo <ip_address> > addr_traddr’ command. traddr is the transport address.

     

    Set the IP address on the Mellanox adapter. For example:

    # ip addr add 1.1.1.1/24 dev enp2s0f0

     

    The address you configured on the post, should be the same address for the NVMe target (1.1.1.1 in this example) to listen. Run:

    # echo 1.1.1.1 > addr_traddr

     

    7. Set RDMA as a transport type, and set the transport RDMA port. Any port number can be set. In the following example, the RDMA port is 4420 (This is the default IANA assignment, see here).

    # echo rdma > addr_trtype

    # echo 4420 > addr_trsvcid

     

    8. Set IPv4 as the Address Family of the port:

    # echo ipv4 > addr_adrfam

     

    9. Create a soft link:

    # ln -s /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name   /sys/kernel/config/nvmet/ports/1/subsystems/nvme-subsystem-name

     

    10. Check dmesg to make sure that the NVMe target is listening on the port:

    # dmesg | grep "enabling port"

    [ 1066.294179] nvmet_rdma: enabling port 1 (1.1.1.1:4420)

     

    At this point, NVME target will be ready for connection requests.

     

    NVMe Client (Initiator) Configuration

     

    NVME has a user-space utility for executing NVMe commands. This tool, called nvme-cli, supports the NVMF functionality and is substantial for some of the operations.

     

    1. Install nvmecli. Clone nvme-cli from Git repository.

    # git clone https://github.com/linux-nvme/nvme-cli.git

    Cloning into 'nvme-cli'...

    remote: Counting objects: 1741, done.

    remote: Total 1741 (delta 0), reused 0 (delta 0), pack-reused 1741

    Receiving objects: 100% (1741/1741), 862.69 KiB | 384.00 KiB/s, done.

    Resolving deltas: 100% (1188/1188), done.

     

    2. Compile the nvme-cli. Execute make and make install

    # cd nvme-cli

    # make

    ...

    # make install

    ...

     

    3. Verify the installation, run nvme command:

    # nvme

    nvme-0.8

    usage: nvme <command> [<device>] [<args>]

     

    The '<device>' may be either an NVMe character device (ex: /dev/nvme0) or an

    nvme block device (ex: /dev/nvme0n1).

     

    The following are all implemented sub-commands:

      list            List all NVMe devices and namespaces on machine

      id-ctrl         Send NVMe Identify Controller

      id-ns           Send NVMe Identify Namespace, display structure

      list-ns         Send NVMe Identify List, display structure

      create-ns       Creates a namespace with the provided parameters

      delete-ns       Deletes a namespace from the controller

      attach-ns       Attaches a namespace to requested controller(s)

      detach-ns       Detaches a namespace from requested controller(s)

      list-ctrl       Send NVMe Identify Controller List, display structure

      get-ns-id       Retrieve the namespace ID of opened block device

      get-log         Generic NVMe get log, returns log in raw format

      fw-log          Retrieve FW Log, show it

      smart-log       Retrieve SMART Log, show it

      smart-log-add   Retrieve additional SMART Log, show it

      error-log       Retrieve Error Log, show it

      get-feature     Get feature and show the resulting value

      set-feature     Set a feature and show the resulting value

      format          Format namespace with new block format

      fw-activate     Activate new firmware slot

      fw-download     Download new firmware

      admin-passthru  Submit arbitrary admin command, return results

      io-passthru     Submit an arbitrary IO command, return results

      security-send   Submit a Security Send command, return results

      security-recv   Submit a Security Receive command, return results

      resv-acquire    Submit a Reservation Acquire, return results

      resv-register   Submit a Reservation Register, return results

      resv-release    Submit a Reservation Release, return results

      resv-report     Submit a Reservation Report, return results

      dsm             Submit a Data Set Management command, return results

      flush           Submit a Flush command, return results

      compare         Submit a Compare command, return results

      read            Submit a read command, return results

      write           Submit a write command, return results

      write-zeroes    Submit a write zeroes command, return results

      write-uncor     Submit a write uncorrectable command, return results

      reset           Resets the controller

      subsystem-reset Resets the controller

      show-regs       Shows the controller registers. Requires admin character device

      discover        Discover NVMeoF subsystems

      connect-all     Discover and Connect to NVMeoF subsystems

      connect         Connect to NVMeoF subsystem

      disconnect      Disconnect from NVMeoF subsystem

      version         Shows the program version

      help            Display this help

     

    See 'nvme help <command>' for more information on a specific command

     

    The following are all installed plugin extensions:

      intel           Intel vendor specific extensions

      lnvm            LightNVM specific extensions

     

    See 'nvme <plugin> help' for more information on a plugin

     

    4. Re-check that nvme-rdma module is loaded. If not, load it using ‘modprobe nvme-rdma’.

    # lsmod | grep nvme

    nvme_rdma              19605  0

    nvme_fabrics           10929  1 nvme_rdma

    nvme_core              43067  2 nvme_fabrics,nvme_rdma

    rdma_cm                45356  5 rpcrdma,nvme_rdma,ib_iser,rdma_ucm,ib_isert

    ib_core               195846  14 rdma_cm,ib_cm,iw_cm,rpcrdma,mlx4_ib,ib_srp,ib_ucm,nvme_rdma,ib_iser,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert

     

    5. Discover available subsystems on NVMF target. Use the ‘nvme discover -t rdma -a <target_ip_address> -s <port-number>’ command.

    Make sure to use the IP of the target port.

    # nvme discover -t rdma -a 1.1.1.1 -s 4420

     

     

    Discovery Log Number of Records 1, Generation counter 1

    =====Discovery Log Entry 0======

    trtype:  rdma

    adrfam:  ipv4

    subtype: nvme subsystem

    treq:    not specified

    portid:  1

    trsvcid: 4420

     

    subnqn:  nvme-subsystem-name

    traddr:  1.1.1.1

     

    rdma_prtype: not specified

    rdma_qptype: connected

    rdma_cms:    rdma-cm

    rdma_pkey: 0x0000

     

    Note: Make sure you are aware of the subnqn name. in this case the value is nvme-subsystem-name.

     

    6. Connect to the discovered subsystems using the command: ‘nvme connect –t rdma –n <discovered_sub_nqn> -t <target_ip_address> -s <port-number>’

    # nvme connect -t rdma -n nvme-subsystem-name -a 1.1.1.1 -s 4420

     

    # lsblk

    NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

    sda               8:0    0 930.4G  0 disk

    \u251c\u2500sda2            8:2    0 929.9G  0 part

    \u2502 \u251c\u2500centos-swap 253:1    0  31.5G  0 lvm  [SWAP]

    \u2502 \u251c\u2500centos-home 253:2    0   100G  0 lvm  /home

    \u2502 \u2514\u2500centos-root 253:0    0 798.4G  0 lvm  /

    \u2514\u2500sda1            8:1    0   500M  0 part /boot

     

    nvme0n1         259:0    0   250G  0 disk

     

    Note: nvme0n1 block device was created.

     

    7.  In order to disconnect from the target run the nvme disconnect command:

    # nvme disconnect -d /dev/nvme0n1

     

    Fast Startup and Persistent Configuration Scripts

     

    Target Configuration

    1. Create a persistent interface configuration:

    # cat /etc/sysconfig/network-scripts/ifcfg-enp2s0f0

    DEVICE=enp2s0f0

    BOOTPROTO=static

    IPADDR=1.1.1.1

    NETMASK=255.255.255.0

    ONBOOT=yes

     

    2. Copy the following script to /etc/rc.d/rc.local (or create a startup script for Linux, see here):

    #!/bin/bash

    ...

    # NVME Target Configuration

    # Assuming the following:

    # Interface is enp2s0f0

    # IP is 1.1.1.1/24

    # link is Up

    # Using NULL Block device nullb0

    # Change the red parameters below to suit your setup

     

    modprobe mlx5_core

    modprobe nvmet

    modprobe nvmet-rdma

    modprobe nvme-rdma

    modprobe null_blk nr_devices=1

     

    mkdir /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name

    cd /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name

    echo 1 > attr_allow_any_host

    mkdir namespaces/10

    cd namespaces/10

    echo -n /dev/nullb0 > device_path

    echo 1 > enable

    mkdir /sys/kernel/config/nvmet/ports/1

    cd /sys/kernel/config/nvmet/ports/1

    echo 1.1.1.1 > addr_traddr

    echo rdma > addr_trtype

    echo 4420 > addr_trsvcid

    echo ipv4 > addr_adrfam

    ln -s /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name   /sys/kernel/config/nvmet/ports/1/subsystems/nvme-subsystem-name

     

    # End of NVNE Target Configuration

     

    3. Make sure that the mode is +x:

    # chmod ugo+x /etc/rc.d/rc.local

     

    4. Reboot the server:

    # reboot

     

    5. Verify that the target is enabled on the interface:

    # lsmod | grep nvme

    nvme_rdma              28672  0

    nvme_fabrics           20480  1 nvme_rdma

    nvme_core              45056  2 nvme_fabrics,nvme_rdma

    nvmet_rdma             24576  1

    nvmet                  49152  7 nvmet_rdma

    rdma_cm                53248  3 nvme_rdma,rdma_ucm,nvmet_rdma

    ib_core               147456  14 ib_cm,rdma_cm,ib_umad,nvme_rdma,ib_uverbs,ib_mad,ib_ipoib,ib_sa,iw_cm,mlx5_ib,ib_ucm,rdma_ucm,nvmet_rdma,mlx4_ib

     

    # dmesg | grep "enabling port"

    [   55.766228] enabling port 1 (1.1.1.1:4420)

     

    Client Configuration

    1. Create a persistent interface configuration:

    # cat /etc/sysconfig/network-scripts/ifcfg-enp2s0f0

    DEVICE=enp2s0f0

    BOOTPROTO=static

    IPADDR=1.1.1.2

    NETMASK=255.255.255.0

    ONBOOT=yes

     

    2. Copy the following script to /etc/rc.d/rc.local (or create a startup script for Linux, see here).

    Note: nvme-cli should be installed (see the first step under the NVMe Client (Initiator) Configuration section above).

    #!/bin/bash

    ...

    # NVME Client Configuration

    # Assuming the following:

    #   Interface is enp2s0f0

    #   IP is 1.1.1.2/24, remote target is 1.1.1.1/24

    #   link is Up

    #   nvme-cli is installed

     

    modprobe mlx5_core

    modprobe nvme-rdma

     

    nvme discover -t rdma -a 1.1.1.1 -s 4420

    nvme connect -t rdma -n nvme-subsystem-name -a 1.1.1.1 -s 4420

     

    # End of NVME Client Configuration

     

    2. Make sure that the mode is +x:

    # chmod ugo+x /etc/rc.d/rc.local

     

    3. Reboot the server. Make sure that the target is enabled and UP:

    # reboot

     

    4. Run lsblk and ldmod

    # lsmod | grep nvme

    nvme_rdma              28672  0

    nvme_fabrics           20480  1 nvme_rdma

    nvme_core              45056  2 nvme_fabrics,nvme_rdma

    rdma_cm                53248  2 nvme_rdma,rdma_ucm

    ib_core               147456  13 ib_cm,rdma_cm,ib_umad,nvme_rdma,ib_uverbs,ib_mad,ib_ipoib,ib_sa,iw_cm,mlx5_ib,ib_ucm,rdma_ucm,mlx4_ib

    mlx_compat             16384  19 ib_cm,rdma_cm,ib_umad,ib_core,nvme_rdma,ib_uverbs,ib_mad,ib_addr,mlx4_en,ib_ipoib,mlx5_core,ib_sa,iw_cm,mlx5_ib,mlx4_core,ib_ucm,rdma_ucm,ib_netlink,mlx4_ib

     

    # lsblk

    NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

    sda               8:0    0 930.4G  0 disk

    \u251c\u2500sda2            8:2    0 929.9G  0 part

    \u2502 \u251c\u2500centos-swap 253:1    0  31.5G  0 lvm  [SWAP]

    \u2502 \u251c\u2500centos-home 253:2    0   100G  0 lvm  /home

    \u2502 \u2514\u2500centos-root 253:0    0 798.4G  0 lvm  /

    \u2514\u2500sda1            8:1    0   500M  0 part /boot

    nvme0n1         259:0    0   250G  0 disk

     

    Useful commands

     

    nvme list

    Run from the client to see the list of the NVMe devices currently connected.

    # nvme list

    Node             SN                   Model                                    Namespace Usage                      Format           FW Rev 

    ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

    /dev/nvme0n1     3b605a467714f272     Linux                                    10        268.44  GB / 268.44  GB    512   B +  0 B   4.8.7

     

     

    Benchmarking

    After establishing a connection between NVMF host (initiator) and NVMF target, find a new NVMe block device under /dev/dir in the initiator side. The block device represents the remote backing store of the connected subsystem.

     

    Perform a simple traffic test on the block device to make sure everything is working properly. Use the fio command (install fio package if not available) or any other traffic generator.

    Note: Make sure to update the filename parameter to suit the nvme device created in your system.

    # fio --bs=64k --numjobs=16 --iodepth=4 --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --time_based --runtime=60 --filename=/dev/nvme0n1  --name=read-phase --rw=randread

     

    For more details about fio installations and usage, see: HowTo Install Flexible I/O (Fio) for Storage Benchmarking I/O Testing.

     

    Troubleshooting

    1. In case the soft link fails, This is the output that you get when executing the dmesg command.

    # dmesg | grep nvmet_rdma

    [  462.992749] nvmet_rdma: binding CM ID to 1.1.1.1:4420 failed (-19)

    [ 8552.951381] nvmet_rdma: binding CM ID to 1.1.1.1:4420 failed (-99)

    Check the IP connectivity. Ping and try again.

     

    2. RDMA performance tools may not work by default. Follow HowTo Enable Perftest Package for Upstream Kernel to make sure the relevant modules and userspace libraries are enabled.

     

    3. The command, nvme disconnect -n nvme-subsystem-name may fail due to a bug in the nvme, in that case use # nvme disconnect -d /dev/nvme0n1.

    # nvme disconnect -n nvme-subsystem-name

     

    4. In case you can't enable the nvme-rdma module, make sure you installed the MLNX_OFED using --with-nvmf flag.