3 Replies Latest reply on Aug 10, 2017 1:29 AM by march

    Mellanox card disappeared from PCI bus

    mplaneta

      Hello,

       

      I have to computers with Mellanox ConnectX-3 Infiniband cards connected with each other directly. I configured several VMs on each node with SR IOV passthrough of Infiniband cards. When I was mostly done I tried to also configure IB to make it usable on the host. I rebooted the hosts and saw that the IB cards completely disappeared from the PCI bus. So I rebooted the system several times again and one of the IB cards reappeared. But another one is still missing. I completely disconnected the host from any cable and even unplugged and plugged the card, but this had no effect.

       

      Important fact is that when I boot any of the nodes, one of the first screens which I see during the boot process shows some message from IB firmware. There I can enter into some menu and enable or disable SR-IOV, set maximum number of  physical functions, and some other things. When the IB card is gone from lspci, the boot screen from the firmware does not appear.

       

      Now I try to describe my system and outline the actions I took when I configured IB passthrough. As the host I have Debian 9 and I installed IB drivers from the Debian repository. On the guests I have Centos 7.3 and there I installed Mellanox distribution of OFED for Centos 7.3. For virtualization I use Qemu/KVM with libvirt.

       

      My card shows on the host as:

      05:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

      05:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      05:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      05:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

      05:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

       

      Both host and guest used mlx4_core drivers, here is the list of some of the modules in the host system:

      Module                  Size  Used by

      mlx4_ib               163840  0

      mlx4_en               114688  0

      mlx4_core             303104  2 mlx4_en,mlx4_ib

      kvm_intel             192512  0

      kvm                   589824  1 kvm_intel

      irqbypass              16384  1 kvm

      ib_umad                24576  0

      ib_core               208896  2 ib_umad,mlx4_ib

      I also was loading ib_ipoib on the host, as well as on the guest. But on the guest it was crashing the kernel.

       

      Additional suspicious thing happened when I was attaching virtual functions to the guest systems (sudo virsh attach-device ...). Following messages were appearing in the kernel log:

       

      Jul  6 16:07:04 ib1 kernel: [  281.707448] vfio-pci 0000:05:00.4: enabling device (0000 -> 0002)

      Jul  6 16:07:06 ib1 kernel: [  283.475412] virbr1: port 5(vnet3) entered learning state

      Jul  6 16:07:08 ib1 kernel: [  285.491419] virbr1: port 5(vnet3) entered forwarding state

      Jul  6 16:07:08 ib1 kernel: [  285.491424] virbr1: topology change detected, propagating

      Jul  6 16:07:13 ib1 kernel: [  290.895918] kvm [2264]: vcpu0, guest rIP: 0xffffffff81060d78 disabled perfctr wrmsr: 0xc2 data 0xffff

      Jul  6 16:07:13 ib1 kernel: [  290.933587] kvm: zapping shadow pages for mmio generation wraparound

      Jul  6 16:07:13 ib1 kernel: [  290.939149] kvm: zapping shadow pages for mmio generation wraparound

      Jul  6 16:07:14 ib1 kernel: [  291.721929] mlx4_core 0000:05:00.0: Received reset from slave:4

      Jul  6 16:07:14 ib1 kernel: [  291.767436] mlx4_core 0000:05:00.0: Unknown command:0x55 accepted from slave:4

      Jul  7 07:52:13 ib1 kernel: [56990.799006] mlx4_core 0000:05:00.0: mlx4_eq_int: slave:2, srq_no:0x41, event: 14(00)

      Jul  7 07:52:13 ib1 kernel: [56990.799009] mlx4_core 0000:05:00.0: mlx4_eq_int: sending event 14(00) to slave:2

      Jul  7 08:39:31 ib1 kernel: [59828.975516] mlx4_core 0000:05:00.0: Received reset from slave:4

      Jul  7 08:39:31 ib1 kernel: [59829.044683] virbr1: port 5(vnet3) entered disabled state

      Jul  7 08:39:31 ib1 kernel: [59829.044752] device vnet3 left promiscuous mode

       

      Note the line with "Unknown command".

       

      I did not update the firmware, at least no in a recent time.

       

      ibstat on the working system says following:

       

      CA 'mlx4_0'

      CA type: MT4099

      Number of ports: 2

      Firmware version: 2.34.5000

      Hardware version: 0

      Node GUID: 0xf45214030010a4a0

      System image GUID: 0xf45214030010a4a3

      Port 1:

           State: Down

           Physical state: Polling

           Rate: 10

           Base lid: 0

           LMC: 0

           SM lid: 0

           Capability mask: 0x0250486a

           Port GUID: 0xf45214030010a4a1

           Link layer: InfiniBand

      Port 2:

           State: Down

           Physical state: Polling

           Rate: 10

           Base lid: 0

           LMC: 0

           SM lid: 0

           Capability mask: 0x0250486a

           Port GUID: 0xf45214030010a4a2

           Link layer: InfiniBand

       

      Could you help me to get my card back?