1 Reply Latest reply on Jul 21, 2016 4:24 PM by rage@mellanox.com

    Trouble with ConnectX-3 VPI VFs with SR-IOV

    lasser

      Hi,

       

      I am trying to get VFs working on the IB card to pass through to KVM guests. Following through the steps in HowTo Configure SR-IOV for ConnectX-3 with KVM (InfiniBand), I get in trouble after restarting openibd in step "Enable SR-IOV on the MLNX_OFED Driver"  with the following snippets from dmesg output (see attachment for further detail):

       

      [  37.547412] mlx4_core: device is working in RoCE mode: Roce V1

      [   37.572033] mlx4_core: gid_type 1 for UD QPs is not supported by the devicegid_type 0 was chosen instead

      [   37.623776] mlx4_core: UD QP Gid type is: V1

      [   39.430768] mlx4_core 0000:41:00.0: Enabling SR-IOV with 4 VFs

      [   39.562398] pci 0000:41:00.1: [15b3:1004] type 00 class 0x028000

      [   39.569757] mlx4_core: Initializing 0000:41:00.1

      [   39.597827] mlx4_core 0000:41:00.1: enabling device (0000 -> 0002)

      [   39.627547] mlx4_core 0000:41:00.1: Detected virtual function - running in slave mode

      [   39.684547] mlx4_core 0000:41:00.1: PF is not ready - Deferring probe

      [   39.714917] pci 0000:41:00.1: Driver mlx4_core requests probe deferral

      [   39.744881] pci 0000:41:00.2: [15b3:1004] type 00 class 0x028000

      [   39.752156] mlx4_core: Initializing 0000:41:00.2

      [   39.782028] mlx4_core 0000:41:00.2: enabling device (0000 -> 0002)

      [   39.813140] mlx4_core 0000:41:00.2: Skipping virtual function:2

      [   39.843525] pci 0000:41:00.3: [15b3:1004] type 00 class 0x028000

      [   39.850805] mlx4_core: Initializing 0000:41:00.3

      [   39.879927] mlx4_core 0000:41:00.3: enabling device (0000 -> 0002)

      [   39.909787] mlx4_core 0000:41:00.3: Skipping virtual function:3

      [   39.939078] pci 0000:41:00.4: [15b3:1004] type 00 class 0x028000

      [   39.946361] mlx4_core: Initializing 0000:41:00.4

      [   39.974914] mlx4_core 0000:41:00.4: enabling device (0000 -> 0002)

      [   40.004714] mlx4_core 0000:41:00.4: Skipping virtual function:4

      [   40.033411] mlx4_core 0000:41:00.0: Running in master mode

       

      --- Stacks of MSI/MSI-X messages later ---

       

      [   40.582243] mlx4_core: Initializing 0000:41:00.1

      [   40.610237] mlx4_core 0000:41:00.1: enabling device (0000 -> 0002)

      [   40.639442] mlx4_core 0000:41:00.1: Detected virtual function - running in slave mode

      [   40.694489] mlx4_core 0000:41:00.1: Sending reset

      [   40.722845] mlx4_core 0000:41:00.0: Received reset from slave:1

      [   40.750438] mlx4_core 0000:41:00.1: Sending vhcr0

      [   40.777898] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde000 flags=0x0050]

      [   40.833233] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde040 flags=0x0050]

      [   40.890985] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde080 flags=0x0050]

      [   40.949797] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.1 domain=0x0000 address=0x00000037f7bde0c0 flags=0x0050]

      [   46.047238] mlx4_core 0000:41:00.0: command 0x2e failed: fw status = 0x1

      [   46.077884] mlx4_core 0000:41:00.0: mlx4_master_process_vhcr: Failed reading vhcr ret: 0xfffffffb

      [   46.139267] mlx4_core 0000:41:00.0: Failed processing vhcr for slave:1, resetting slave

      [   46.203088] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

      [   46.268572] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

      [   46.336826] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

      [   46.406515] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

      [   46.476511] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

      [   46.546482] mlx4_core 0000:41:00.1: HCA minimum page size:1

      [   46.582122] mlx4_core 0000:41:00.0: slave:1 is out of sync, cmd=0x5, last command=0x0, reset is needed

      [   46.653173] mlx4_core 0000:41:00.0: Turn on internal error to force reset, slave=1, cmd=0x5

      [   46.725318] mlx4_core 0000:41:00.1: The host supports neither eth nor rdma interfaces

      [   46.799557] mlx4_core 0000:41:00.1: QUERY_FUNC_CAP general command failed, aborting (-93)

      [   46.873709] mlx4_core 0000:41:00.1: Failed to obtain slave caps

      [   46.911030] mlx4_core 0000:41:00.0: Received reset from slave:1

      [   46.948493] mlx4_core: probe of 0000:41:00.1 failed with error -93

       

      I am concerned about the AMD-Vi messages, googling doesn't really offer many relevant answers. Running Ubuntu Trusty 14.04 (3.16 kernel, tried 4.2) with latest 3.3 OFED (tried 3.2 as well).

       

      The card is a dual port CX3 VPI with port 1 connected at FDR:

      PSID:                MT_1090120019

       

      The hypervisor is a Dell C6145 sled with latest firmware. SR-IOV is enabled in BIOS as well as IOMMU in grub. I'm coming from Intel land and not too familiar with AMD, does this look right or should I get something additional regarding IOMMU/HW virt/SR-IOV:

       

      [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-71-generic root=UUID=bc67403d-a8e1-4e30-bf48-36ffeecd04e0 ro iommu=pt

      [    4.167159] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40

      [    4.167163] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40

      [    4.167166] AMD-Vi: Interrupt remapping enabled

      [    4.167664] AMD-Vi: Initialized for Passthrough Mode

       

      I do get the cards in lspci, but they seem non-functional:

       

      41:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

      41:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      41:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      41:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      41:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

       

      modprobe options for mlnx4_core:

       

      options mlx4_core num_vfs=4 port_type_array=1,1 probe_vf=1

      (changing probe_vf=0 doesn't help, no interfaces with probe_vf=1)

       

      Thanks for any suggestions!

       

      Cheers