3 Replies Latest reply on Dec 13, 2017 1:35 AM by aphreet

    NVMeOF SLES 12 SP3 :  Initiator with 36 cores unable to discover/connect to target

    madhankb

      Hi,

      I am trying NVMeOF with RoCE on SLES 12 SP3 using the document

      HowTo Configure NVMe over Fabrics

       

      I am noticing that whenever the initiator is having > 32 cores, the initiator is unable to discover/connect to the target. The same procedure works fine if the number of cores <= 32. 

      the dmesg:

       

      kernel: [  373.418811] nvme_fabrics: unknown parameter or missing value 'hostid=a61ecf3f-2925-49a7-9304-cea147f61ae' in ctrl creation request

       

      for a successful connection:

       

      [51354.292021] nvme nvme0: creating 32 I/O queues.

      [51354.879684] nvme nvme0: new ctrl: NQN "mcx", addr 192.168.0.1:4420

       

      Is there any parameter that can restrict the number of the cores the mlx5_core/nvme_rdma/nvmet_rdma driver can use to restrict the IO queue creation and result in a successful discovery/connection?  I won't be able to disable the cores/hyperthreading from the BIOS/UEFI since there are other applications running on the host.

       

      Appreciate any pointers/help!

        • Re: NVMeOF SLES 12 SP3 :  Initiator with 36 cores unable to discover/connect to target
          aviap

          per the unsuccessful error print you've presented I can suggest that you use an nvme connect <device> command options that I see is missing there, and that is: "--nr-io-queues"

          This option specifies the number of io queues to allocate.

          Have you tried this option?

          For examples: # nvme connect --transport=rdma --nr-io-queues=36 --trsvcid=4420 --traddr=10.0.1.14 --nqn=test-nvm

           

          Otherwise, you will hit the "default" option which is “num_online_cpus” (Number of controller IO queues that will be established), and this may explains the error you got:

          “nvme_fabrics: unknown parameter or missing value 'hostid=a61ecf3f-2925-49a7-9304-cea147f61ae' in ctrl creation request”

          read more on that in the article: Add nr_io_queues parameter to connect command: [PATCH v2] nvme-cli/fabrics: Add nr_io_queues parameter to connect command

          +++++++++++++++++++++++++++++++++++++++++++++++++++++++++

          default:

          + pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",

          + p);

          + ret = -EINVAL;

          + goto out;

          +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

           

          Hope this helps

            • Re: NVMeOF SLES 12 SP3 :  Initiator with 36 cores unable to discover/connect to target
              madhankb

              Hi,

               

              Adding the parameter didn't help. It still gives the same error:

               

              athena:~ # nvme  discover -t rdma -a 192.168.0.1 -s 4420

              Failed to write to /dev/nvme-fabrics: Invalid argument

              athena:~ # dmesg |tail -1

              [ 1408.720843] nvme_fabrics: unknown parameter or missing value 'hostid=a61ecf3f-2925-49a7-9304-cea147f61ae' in ctrl creation request

              athena:~ # nvme connect -t rdma --nr-io-queues=32 -a 192.168.0.1 -s 4420 -n mcx

              Failed to write to /dev/nvme-fabrics: Invalid argument

              athena:~ # !dm

              dmesg |tail -1

              [ 1437.914081] nvme_fabrics: unknown parameter or missing value 'hostid=a61ecf3f-2925-49a7-9304-cea147f61ae' in ctrl creation request

                • Re: NVMeOF SLES 12 SP3 :  Initiator with 36 cores unable to discover/connect to target
                  aphreet

                  We faced the same on SLES 12 SP3. We found that in SP3 release version there are two issues related to nvmeof initiator.

                   

                  First, kernel 4.4.73-5-default does not know anything about hostid argument (this causes error message you observe). It was fixed in later updates, 4.4.92-6.18-default does not have this issue.

                   

                  Second issue is in nvme-cli. As you may notice, the last letter from hostid is truncated: 'hostid=a61ecf3f-2925-49a7-9304-cea147f61ae', this causes kernel module to reject host id argument. The root cause is in nvme-cli patch that adds hostid support. It can be fixed by the simple patch added to nvme cli src rpm:

                   

                  diff -crB nvme-cli-v1.2/linux/nvme.h nvme-cli-v1.2.patched/linux/nvme.h

                  *** nvme-cli-v1.2/linux/nvme.h Thu Dec  7 09:42:00 2017

                  --- nvme-cli-v1.2.patched/linux/nvme.h Thu Dec  7 09:50:32 2017

                  ***************

                  *** 23,29 ****

                    /* However the max length of a qualified name is another size */

                    #define NVMF_NQN_SIZE 223

                   

                  ! #define NVMF_HOSTID_SIZE        36

                    #define NVMF_TRSVCID_SIZE 32

                    #define NVMF_TRADDR_SIZE 256

                    #define NVMF_TSAS_SIZE 256

                  --- 23,29 ----

                    /* However the max length of a qualified name is another size */

                    #define NVMF_NQN_SIZE 223

                   

                  ! #define NVMF_HOSTID_SIZE        37

                    #define NVMF_TRSVCID_SIZE 32

                    #define NVMF_TRADDR_SIZE 256

                    #define NVMF_TSAS_SIZE 256