HowTo Configure NVMe over Fabrics (NVMe-oF) Target Offload

Version 20

    This post shows how to configure NVMe over Fabrics (NVMe-oF) target offload for Linux OS using ConnectX-5 (or later) adapter.

    This feature is available using MLNX_OFED 4.1 or later. FW version should be 16.20.1010 or later.

     

    References

     

     

    Setup

    For the target setup, you will need a server equipped with NVMe device(s) and ConnectX-5 (or later) adapter.

    The client side (NVME-oF host) has no limitation regarding HCA type.

    Below is an example using RoCE link layer with ConnectX-5 target and ConnectX-4/5 client:

     

    Prerequisite

    1. Supported OS list, user manual and release notes can be found in the official Mellanox website: http://www.mellanox.com/

     

    2. Kernel: 4.8.0 and later.

     

    3. Install MLNX_OFED 4.1 (or later) on the target server. See HowTo Install MLNX_OFED Driver and make sure to install it with the --with-nvmf flag (and with --add-kernel-support  if needed).

    # ./mlnxofedinstall --add-kernel-support --with-nvmf

     

    4. Before you start, make sure that basic RDMA client-server application work properly (e.g. rping).

     

    Configuration

     

    1.Set num_p2p_queues module parameter when loading the nvme module.

    This module parameter defines the number of extra I/O queues that each NVMe device will try to create for peer-to-peer data transfer.

    # modprobe nvme num_p2p_queues=1

    The actual number of I/O queues that your device can use for peer-to-peer can be queried by reading num_p2p_queues sysfs entry.

    Example:

    # cat /sys/block/<nvme_device>/device/num_p2p_queues

    Note: If you are planning to configure high availability (e.g using multipath), you'll need to set this parameter to 2 (1 for each NVMEoF port + subsystem couple).

     

    2.Load the nvmet and nvmet-rdma modules.

    # modprobe nvmet

    # modprobe nvmet-rdma

    Note:To increase your performance, you can use offload_mem_start, offload_mem_size and offload_buffer_size module parameters of nvmet_rdma module.

    These parameters should describe an unmapped/unused contiguous memory by the kernel (can be set using map/memmap boot parameters for example). The module  uses this chunk of memory starting from offload_mem_start address and ending at offload_mem_start + offload_mem_size (in MiB). This memory chunk is divided among N offload contexts (N = offload_mem_size/offload_buffer_size). Thus the first N offload contexts created will enjoy the benefits of offload_buffer_size (in MiB) contiguous memory for peer-to-peer transactions.

     

    For example (in a system with 64G RAM, we set mem=59392M memmap=59392M boot parameters that means we set the specific amount of 58GB for kernel usage):

    # modprobe nvmet_rdma offload_mem_start=0xf00000000 offload_mem_size=2048 offload_buffer_size=256

    In this example,  set the start address to 0XF00000000 (60GB) and allocating 256MB for each offload context (total chunks N=8).

     

    Note: Starting from MLNX_OFED-4.3, offload_buffer_size parameter is used also for dynamic staging buffer. Default value is 128 MiB and minimal value is 16 MiB.

     

    3.Create a new subsystem.

    # mkdir /sys/kernel/config/nvmet/subsystems/<name_of_subsystem>

    Example:

    # mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem

     

    Note: We use testsubsystem as our subsystem name across all the steps below.

     

    4. Allow any hosts to be connected to this target.

    # echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host

     

    Note: ACL mode can also be used.

     

    5. Set the subsystem to be offloaded subsystem.

    # echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_offload

     

    Note: Starting from MLNX_OFED-4.3, offload feature is configured at subsystem level. Prior to that, one should set the path to the PCI device at the namespace level instead of step 5.

    Run the following command after step 7:

    set the path to the PCI device (e.g. for /dev/nvme0n1, the PCI path is 0000:85:00.0).

    # echo “<domain>:<bus>:<slot>.<func>” > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

    Example:

    # echo "0000:85:00.0" > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/pci_device_path

     

    6.Create a namespace associated with the subsystem.

    # mkdir /sys/kernel/config/nvmet/subsystems/<name_of_subsystem>/namespaces/<nsid>

     

    Example:

    # mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1

     

    Note: The offloaded subsystem must be associated with namespaces attached to the same physical NVMe device.

     

    7.Set the path to the backing store NVMe device (for example: /dev/nvme0n1).

    # echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path

     

    8. Enable the namespace:

    # echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

     

    Note: Currently, an offloaded subsystem can be associated with only one namespace.

     

    9.Create an nvmet port. For Example use the following attributes:

    • Port : 4420
    • IP Address: The interface IP.
    • Transport type -rdma
    • Address Family -ipv4

     

    Example:

    # mkdir /sys/kernel/config/nvmet/ports/1

    # echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

    # echo 2.2.2.6 > /sys/kernel/config/nvmet/ports/1/addr_traddr

    # echo "rdma" > /sys/kernel/config/nvmet/ports/1/addr_trtype

    # echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam

     

    10.Enable the nvmet-port

    # ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

     

    Note: Only if all subsystems associated with an NVMe-oF port are offloaded, and if the port is capable of performing peer-to-peer transactions, it will become offloaded.

    Currently, an offloaded port can be associated with only one subsystem.

     

    Script Example

    modprobe nvme num_p2p_queues=1

    modprobe nvmet

    modprobe nvmet-rdma

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_offload

    mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1

    echo -n /dev/nvme0n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path

    echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

    mkdir /sys/kernel/config/nvmet/ports/1

    echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

    echo 2.2.2.6 > /sys/kernel/config/nvmet/ports/1/addr_traddr

    echo "rdma" > /sys/kernel/config/nvmet/ports/1/addr_trtype

    echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam

    ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

     

    Client Connection

    There is no change and limitations on the way we configure our NVMe-oF host.

    For Example, follow the next steps to connect to the offloaded NVME-oF target created earlier.

     

    1. Load the following modules.

    modprobe nvme

    modprobe nvme-rdma

     

    2. Discover the device

    # nvme discover -t rdma -a 2.2.2.6 -s 4420

     

    Discovery Log Number of Records 1, Generation counter 1

    =====Discovery Log Entry 0======

    trtype:  rdma

    adrfam:  ipv4

    subtype: nvme subsystem

    treq:    not specified

    portid:  1

    trsvcid: 4420

     

    subnqn:  testsubsystem

    traddr:  2.2.2.6

     

    rdma_prtype: infiniband

    rdma_qptype: datagram

    rdma_cms:    unrecognized

    rdma_pkey: 0x0000

     

    3. Connect

    # nvme connect -t rdma -n testsubsystem -a 2.2.2.6 -s 4420

     

    # nvme list

    Node             SN                   Model                                    Namespace Usage                      Format           FW Rev 

    ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

    /dev/nvme0n1     d731aa3b5a15c48      Linux                                    1         480.10  GB / 480.10  GB    512   B +  0 B   4.8.7

     

    Benchmark Tests

    Simple test can be found in Simple NVMe-oF Target Offload Benchmark.