HowTo Configure NVMe over Fabrics (NVMe-oF) Target Offload

Version 19

    This post shows how to configure NVMe over Fabrics (NVMe-oF) target offload for Linux OS using ConnectX-5 (or later) adapter.

    This feature is available using MLNX_OFED 4.1 or later. FW version should be 16.20.1010 or later.

     

    References

     

     

    Setup

    For the target setup, you will need a server equipped with NVMe device(s) and ConnectX-5 (or later) adapter.

    The client side (NVME-oF host) has no limitation regarding HCA type.

    Below is an example using RoCE link layer with ConnectX-5 target and ConnectX-4/5 client:

     

    Prerequisite

    1. Supported OS list, user manual and release notes can be found in the official Mellanox website: http://www.mellanox.com/

     

    2. Kernel: 4.8.0 and later.

     

    3. Install MLNX_OFED 4.1 (or later) on the target server. See HowTo Install MLNX_OFED Driver and make sure to install it with the --with-nvmf flag (and with --add-kernel-support  if needed).

    # ./mlnxofedinstall --add-kernel-support --with-nvmf

     

    4. Before you start, make sure that basic RDMA client-server application work properly (e.g. rping).

     

    Configuration

     

    1.Set num_p2p_queues module parameter when loading the nvme module.

    This module parameter defines the number of extra I/O queues that each NVMe device will try to create for peer-to-peer data transfer.

    # modprobe nvme num_p2p_queues=1

    The actual number of I/O queues that your device can use for peer-to-peer can be queried by reading num_p2p_queues sysfs entry.

    Example:

    # cat /sys/block/<nvme_device>/device/num_p2p_queues

    Note: If you are planning to configure high availability (e.g using multipath), you'll need to set this parameter to 2 (1 for each NVMEoF port + subsystem couple).

     

    2.Load the nvmet and nvmet-rdma modules.

    # modprobe nvmet

    # modprobe nvmet-rdma

    Note:To increase your performance, you can use offload_mem_start, offload_mem_size and offload_buffer_size module parameters of nvmet_rdma module.

    These parameters should describe an unmapped/unused contiguous memory by the kernel (can be set using map/memmap boot parameters for example). The module  uses this chunk of memory starting from offload_mem_start address and ending at offload_mem_start + offload_mem_size (in MiB). This memory chunk is divided among N offload contexts (N = offload_mem_size/offload_buffer_size). Thus the first N offload contexts created will enjoy the benefits of offload_buffer_size (in MiB) contiguous memory for peer-to-peer transactions.

     

    For example (in a system with 64G RAM, we set mem=59392M memmap=59392M boot parameters that means we set the specific amount of 58GB for kernel usage):

    # modprobe nvmet_rdma offload_mem_start=0xf00000000 offload_mem_size=2048 offload_buffer_size=256

    In this example,  set the start address to 0XF00000000 (60GB) and allocating 256MB for each offload context (total chunks N=8).

     

    Note: In order to use the static staging buffer feature, you must supply an unmapped physical address.

     

    3.Create a new subsystem.

    # mkdir /sys/kernel/config/nvmet/subsystems/<name_of_subsystem>

    Example:

    # mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem

     

    Note: We use testsubsystem as our subsystem name across all the steps below.

     

    4. Allow any hosts to be connected to this target.

    # echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host

     

    Note: ACL mode can also be used.

     

    5.Create a namespace associated with the subsystem.

    # mkdir /sys/kernel/config/nvmet/subsystems/<name_of_subsystem>/namespaces/<nsid>

     

    Example:

    # mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1

     

    Note: The offloaded subsystem must be associated with namespaces attached to the same physical NVMe device.

     

    6.Set the path to the backing store NVMe device (for example: /dev/nvme0n1).

    # echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path

     

    7.Set the path to the PCI device (for /dev/nvme0n1, the PCI path is 0000:2:00.0).

    # echo “<domain>:<bus>:<slot>.<func>” > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

     

    Example:

    # echo "0000:85:00.0" > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

     

    8. Enable the namespace:

    # echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

     

    Note: The NVMe-oF subsystem becomes offloaded only if all namespaces associated with an NVME-oF subsystem are offloaded . Currently, an offloaded subsystem can be associated with only one namespace.

     

    9.Create an nvmet port. For Example use the following attributes:

    • Port : 4420
    • IP Address: The interface IP.
    • Transport type -rdma
    • Address Family -ipv4

     

    Example:

    # mkdir /sys/kernel/config/nvmet/ports/1

    # echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

    # echo 2.2.2.6 > /sys/kernel/config/nvmet/ports/1/addr_traddr

    # echo "rdma" > /sys/kernel/config/nvmet/ports/1/addr_trtype

    # echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam

     

    10.Enable the nvmet-port

    # ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

     

    Note: Only if all subsystems associated with an NVMe-oF port are offloaded, and if the port is capable of performing peer-to-peer transactions, it will become offloaded.

    Currently, an offloaded port can be associated with only one subsystem.

     

    Script Example

    modprobe nvme num_p2p_queues=1

    modprobe nvmet

    modprobe nvmet-rdma

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1

    echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path

    echo "0000:85:00.0" > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

    mkdir /sys/kernel/config/nvmet/ports/1

    echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

    echo 2.2.2.6 > /sys/kernel/config/nvmet/ports/1/addr_traddr

    echo "rdma" > /sys/kernel/config/nvmet/ports/1/addr_trtype

    echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam

    ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

     

    Client Connection

    There is no change and limitations on the way we configure our NVMe-oF host.

    For Example, follow the next steps to connect to the offloaded NVME-oF target created earlier.

     

    1. Load the following modules.

    modprobe nvme

    modprobe nvme-rdma

     

    2. Discover the device

    # nvme discover -t rdma -a 2.2.2.6 -s 4420

     

    Discovery Log Number of Records 1, Generation counter 1

    =====Discovery Log Entry 0======

    trtype:  rdma

    adrfam:  ipv4

    subtype: nvme subsystem

    treq:    not specified

    portid:  1

    trsvcid: 4420

     

    subnqn:  testsubsystem

    traddr:  2.2.2.6

     

    rdma_prtype: infiniband

    rdma_qptype: datagram

    rdma_cms:    unrecognized

    rdma_pkey: 0x0000

     

    3. Connect

    # nvme connect -t rdma -n testsubsystem -a 2.2.2.6 -s 4420

     

    # nvme list

    Node             SN                   Model                                    Namespace Usage                      Format           FW Rev 

    ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

    /dev/nvme0n1     d731aa3b5a15c48      Linux                                    1         480.10  GB / 480.10  GB    512   B +  0 B   4.8.7

     

    Benchmark Tests

    Simple test can be found in Simple NVMe-oF Target Offload Benchmark.