1 Reply Latest reply on Oct 11, 2018 7:41 PM by zhangsuo

    ConnectX-5 error: Failed to write to /dev/nvme-fabrics: Invalid cross-device link

    yaolin

      I have 2 ConnectX-5 NICs in my PC (Ubuntu 18.04, kernel 4.15.0-36). They are in 2 different subnets (192.168.1.100/24, 192.168.2.100/24). I have 4 NVMoF targets and I try to connect them from my PC:

       

      sudo nvme connect -t rdma -a 192.168.2.52 -n nqn.2018-09.com.52 -s 4420

      sudo nvme connect -t rdma -a 192.168.1.9 -n nqn.2018-09.com.9 -s 4420

      sudo nvme connect -t rdma -a 192.168.2.54 -n nqn.2018-09.com.54 -s 4420

      sudo nvme connect -t rdma -a 192.168.1.2 -n nqn.2018-09.com.2 -s 4420

      Failed to write to /dev/nvme-fabrics: Invalid cross-device link

       

      I disconnect all these targets and reboot the PC. Then I try to connect to these targets in a different order:

       

      sudo nvme connect -t rdma -a 192.168.1.2 -n nqn.2018-09.com.2 -s 4420

      sudo nvme connect -t rdma -a 192.168.1.9 -n nqn.2018-09.com.9 -s 4420

      sudo nvme connect -t rdma -a 192.168.2.52 -n nqn.2018-09.com.52 -s 4420

      Failed to write to /dev/nvme-fabrics: Invalid cross-device link

       

      I google a bit. It seems that there are 2 report instances of this error message related to Mellanox NIC. But I don't understand the nature of this error and I don't see any work-around. Any suggestions? Here's some info from my PC.

       

       

      yao@Host1:~$ lspci | grep Mellan

      15:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

      21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

       

      yao@Host1:~$ lspci -vvv -s 15:00.0

      15:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

      Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]

      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+

      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

      Latency: 0, Cache Line Size: 32 bytes

      Interrupt: pin A routed to IRQ 33

      NUMA node: 0

      Region 0: Memory at 387ffe000000 (64-bit, prefetchable) [size=32M]

      Expansion ROM at 90500000 [disabled] [size=1M]

      Capabilities: <access denied>

      Kernel driver in use: mlx5_core

      Kernel modules: mlx5_core

       

      yao@Host1:~$ sudo lsmod | grep mlx

      mlx5_ib               196608  0

      ib_core               225280  9 ib_cm,rdma_cm,ib_umad,nvme_rdma,ib_uverbs,iw_cm,mlx5_ib,ib_ucm,rdma_ucm

      mlx5_core             544768  1 mlx5_ib

      mlxfw                  20480  1 mlx5_core

      devlink                45056  1 mlx5_core

      ptp                    20480  2 e1000e,mlx5_core

       

      yao@Host1:~$ modinfo mlx5_core

      filename:       /lib/modules/4.15.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko

      version:        5.0-0

      license:        Dual BSD/GPL

      description:    Mellanox Connect-IB, ConnectX-4 core driver

      author:         Eli Cohen <eli@mellanox.com>

      srcversion:     C271CE9036D77E924A8E038

      alias:          pci:v000015B3d0000A2D3sv*sd*bc*sc*i*

      alias:          pci:v000015B3d0000A2D2sv*sd*bc*sc*i*

      alias:          pci:v000015B3d0000101Csv*sd*bc*sc*i*

      alias:          pci:v000015B3d0000101Bsv*sd*bc*sc*i*

      alias:          pci:v000015B3d0000101Asv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001019sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001018sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001017sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001016sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001015sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001014sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001013sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001012sv*sd*bc*sc*i*

      alias:          pci:v000015B3d00001011sv*sd*bc*sc*i*

      depends:        devlink,ptp,mlxfw

      retpoline:      Y

      intree:         Y

      name:           mlx5_core

      vermagic:       4.15.0-36-generic SMP mod_unload

      signat:         PKCS#7

      signer:        

      sig_key:       

      sig_hashalgo:   md4

      parm:           debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint)

      parm:           prof_sel:profile selector. Valid range 0 - 2 (uint)

       

       

      yao@Host1:~$ dmesg

      ...

      [   78.772669] nvme nvme0: queue_size 128 > ctrl maxcmd 64, clamping down

      [   78.856378] nvme nvme0: creating 8 I/O queues.

      [   88.297468] nvme nvme0: new ctrl: NQN "nqn.2018-09.com.52", addr 192.168.2.52:4420

      [  101.561197] nvme nvme1: queue_size 128 > ctrl maxcmd 64, clamping down

      [  101.644852] nvme nvme1: creating 8 I/O queues.

      [  111.083806] nvme nvme1: new ctrl: NQN "nqn.2018-09.com.9", addr 192.168.1.9:4420

      [  151.368016] nvme nvme2: queue_size 128 > ctrl maxcmd 64, clamping down

      [  151.451717] nvme nvme2: creating 8 I/O queues.

      [  160.893710] nvme nvme2: new ctrl: NQN "nqn.2018-09.com.54", addr 192.168.2.54:4420

      [  169.789368] nvme nvme3: queue_size 128 > ctrl maxcmd 64, clamping down

      [  169.873068] nvme nvme3: creating 8 I/O queues.

      [  177.657661] nvme nvme3: Connect command failed, error wo/DNR bit: -16402

      [  177.657669] nvme nvme3: failed to connect queue: 4 ret=-18

      [  177.951379] nvme nvme3: Reconnecting in 10 seconds...

      [  188.138167] general protection fault: 0000 [#1] SMP PTI

      [  188.138172] Modules linked in: nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic usbhid hid

      [  188.138248]  radeon i2c_algo_bit ttm mlx5_core drm_kms_helper syscopyarea e1000e sysfillrect mlxfw sysimgblt devlink ahci fb_sys_fops ptp psmouse drm pps_core libahci wmi

      [  188.138272] CPU: 0 PID: 390 Comm: kworker/u56:7 Not tainted 4.15.0-36-generic #39-Ubuntu

      [  188.138275] Hardware name: HP HP Z4 G4 Workstation/81C5, BIOS P62 v01.51 05/08/2018

      [  188.138283] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]

      [  188.138290] RIP: 0010:nvme_rdma_alloc_queue+0x3c/0x190 [nvme_rdma]

      [  188.138294] RSP: 0018:ffffc04c041e3e08 EFLAGS: 00010286

      [  188.138298] RAX: 0000000000000000 RBX: 890a8eecb83679a9 RCX: ffff9f9b5ec10820

      [  188.138301] RDX: ffffffffc0cd5600 RSI: ffffffffc0cd43ab RDI: ffff9f9ad037c000

      [  188.138304] RBP: ffffc04c041e3e28 R08: 000000000000020c R09: 0000000000000000

      [  188.138307] R10: 0000000000000000 R11: 000000000000020f R12: ffff9f9ad037c000

      [  188.138309] R13: 0000000000000000 R14: 0000000000000020 R15: 0000000000000000

      [  188.138313] FS:  0000000000000000(0000) GS:ffff9f9b5f200000(0000) knlGS:0000000000000000

      [  188.138316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

      [  188.138319] CR2: 00007f347e159fb8 CR3: 00000001a740a006 CR4: 00000000003606f0

      [  188.138323] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

      [  188.138325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

      [  188.138327] Call Trace:

      [  188.138335]  nvme_rdma_configure_admin_queue+0x22/0x2d0 [nvme_rdma]

      [  188.138341]  nvme_rdma_reconnect_ctrl_work+0x27/0xd0 [nvme_rdma]

      [  188.138349]  process_one_work+0x1de/0x410

      [  188.138354]  worker_thread+0x32/0x410

      [  188.138361]  kthread+0x121/0x140

      [  188.138365]  ? process_one_work+0x410/0x410

      [  188.138370]  ? kthread_create_worker_on_cpu+0x70/0x70

      [  188.138378]  ret_from_fork+0x35/0x40

      [  188.138381] Code: 89 e5 41 56 41 55 41 54 53 48 8d 1c c5 00 00 00 00 49 89 fc 49 89 c5 49 89 d6 48 29 c3 48 c7 c2 00 56 cd c0 48 c1 e3 04 48 03 1f <48> 89 7b 18 48 8d 7b 58 c7 43 50 00 00 00 00 e8 50 05 40 ce 45

      [  188.138443] RIP: nvme_rdma_alloc_queue+0x3c/0x190 [nvme_rdma] RSP: ffffc04c041e3e08

      [  188.138447] ---[ end trace c9efe5e9bc3591f2 ]---

       

      yao@Host1:~$ dmesg | grep mlx

      [    2.510581] mlx5_core 0000:15:00.0: enabling device (0100 -> 0102)

      [    2.510732] mlx5_core 0000:15:00.0: firmware version: 16.21.2010

      [    4.055064] mlx5_core 0000:15:00.0: Port module event: module 0, Cable plugged

      [    4.061558] mlx5_core 0000:21:00.0: enabling device (0100 -> 0102)

      [    4.061775] mlx5_core 0000:21:00.0: firmware version: 16.21.2010

      [    4.966172] mlx5_core 0000:21:00.0: Port module event: module 0, Cable plugged

      [    4.972503] mlx5_core 0000:15:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(64) RxCqeCmprss(0)

      [    5.110943] mlx5_core 0000:21:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(64) RxCqeCmprss(0)

      [    5.247925] mlx5_core 0000:15:00.0 enp21s0: renamed from eth0

      [    5.248600] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0

      [    5.275912] mlx5_core 0000:21:00.0 enp33s0: renamed from eth1

      [   23.736990] mlx5_core 0000:21:00.0 enp33s0: Link up

      [   23.953415] mlx5_core 0000:15:00.0 enp21s0: Link up

      [  188.138172] Modules linked in: nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic usbhid hid

      [  188.138248]  radeon i2c_algo_bit ttm mlx5_core drm_kms_helper syscopyarea e1000e sysfillrect mlxfw sysimgblt devlink ahci fb_sys_fops ptp psmouse drm pps_core libahci wmi

      [  662.506623] Modules linked in: cfg80211 nvme_rdma rdma_ucm rdma_cm nvme_fabrics nvme_core ib_ucm ib_uverbs ib_umad iw_cm ib_cm nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd glue_helper cryptd snd_hda_core snd_hwdep intel_cstate snd_pcm cp210x snd_seq_midi snd_seq_midi_event joydev input_leds snd_rawmidi usbserial snd_seq snd_seq_device snd_timer snd mei_me soundcore wmi_bmof hp_wmi sparse_keymap ioatdma mac_hid intel_rapl_perf mei dca intel_wmi_thunderbolt shpchp serio_raw sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib ib_core amdgpu chash hid_generic