4 Replies Latest reply on May 2, 2013 3:44 AM by coiter

    Failed loading HCA driver and Access Layer

      Im sorry to ask again, im new to Infiniband so dont know all the tricks and hwo to make it work just yet:)

       

      I managed to install the software as to previous article, and rebooted the blade node.

       

      when i start again now i get this error:

       

      Loading HCA driver and Access Layer = Failed

      Please open an ssue in the http://bugs.openfabrics.org and attach /tmp/ib_debug_info_log.

       

      the debug file is a copy of dmesg and it has the following lines

       

      mlx4_ib           80171  0
      ib_mad            40497  5 ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthca
      ib_core           69979  9 ib_cm,ib_sa,ib_uverbs,ib_umad,iw_nes,iw_cxgb3,mlx4_ib,ib_mthca,ib_mad
      mlx4_en           97664  0
      mlx4_core        185193  2 mlx4_ib,mlx4_en

      mlx4_core: Mellanox ConnectX core driver v1.0-mlnx_ofed1.5.3 (November 3, 2011)

      mlx4_core: Initializing 0000:03:00.0

      mlx4_core 0000:03:00.0: PCI INT A -> GSI 48 (level, low) -> IRQ 48

      mlx4_core 0000:03:00.0: setting latency timer to 64

      mlx4_core 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vend                                                       or for a firmware update.
      mlx4_core 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vend                                                       or for a firmware update.

      mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.8.3 (June 2012)

      mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-mlnx_ofed1.5.3 (November 3, 2011)

      Apr 28 12:32:48 dpn01 modprobe: FATAL: Error inserting ib_ipoib (/lib/modules/2.6.32-279.el6.x86_64/extra/mln                                                       x-ofa_kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Unknown symbol in module, or unknown parameter (see d                                                       mesg)
      Apr 28 12:44:44 dpn01 modprobe: FATAL: Error inserting ib_ipoib (/lib/modules/2.6.32-279.el6.x86_64/extra/mln                                                       x-ofa_kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Unknown symbol in module, or unknown parameter (see d                                                       mesg)

      Apr 28 12:48:06 dpn01 root[4494]: Set node_desc for mlx4_0: dpn01 HCA-1

      root 15022  0 12:29 ?   00:00:00 [mlx4]
      root 15042  0 12:29 ?   00:00:00 [mlx4_opreq]
      root 15682  0 12:29 ?   00:00:00 [mlx4_sense]
      root 15772  0 12:29 ?   00:00:00 [mlx4_en]
      root 26512  0 12:29 ?   00:00:00 [mlx4_ib]

       

      it looks like its loading the HCA ethernet drivers, but why fail on the other, is it because of the firmware lines above?

       

      any help appreciated.

        • Re: Failed loading HCA driver and Access Layer
          1. What OS are you using
          2. Is it possible to get the serial number, firmware and/or PSID?
          3. Where did you download this OFED version?
            • Re: Failed loading HCA driver and Access Layer
              justinclift

              coiter - As a thought, since you're running CentOS 6.x, you might find it easier to start with the CentOS provided drivers (instead of Mellanox OFED).

               

              From a fresh CentOS install (without Mellanox OFED), you then do:

               

              $ sudo yum groupinstall "Infiniband Support"

               

              In theory that should install working drivers and things should "just work".

               

              It's apparently not as optimised as the Mellanox OFED stuff.  But it's a pretty useful way to get up and running at first with minimal hassles.

               

              It's also pretty easy to remove those packages afterwards though if you want to try a different approach (ie Mellanox OFED):

               

              $ sudo yum groupremove "Infiniband Support"

               

              Hope that's helpful.

               

              (note - edited for typo fixes)

                • Re: Re: Failed loading HCA driver and Access Layer
                  justinclift

                  As an extra thought, when diagnosing new setups on RHEL and CentOS, the output from "lspci" is often helpful.

                   

                  For example, on a test box here:

                   

                  $ sudo lspci

                  00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)

                  00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)

                  00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)

                  00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)

                  00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)

                  00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)

                  00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)

                  00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)

                  00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)

                  00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4)

                  00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)

                  00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04)

                  00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)

                  00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)

                  01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

                  03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09)

                  04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03)


                  This helps us figure out what card the box is seeing, and where it's located on the PCI bus. (01:00.0 in the example above)


                  With the CentOS provided "mstflint" package installed (or the Mellanox OFED "flint" equivalent), you can use that PCI address to check the firmware revision of the card(s):


                  $ sudo yum install mstflint

                  $ sudo mstflint -d 01:00.0 query

                  Image type:      ConnectX

                  FW Version:      2.9.1000

                  Device ID:       25418

                  Description:     Node             Port1            Port2            Sys image

                  GUIDs:           0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

                  MACs:                                 0003ba00edb9     0003ba00edba

                  Board ID:         (MT_04A0120002)

                  VSD:            

                  PSID:            MT_04A0120002

                   

                  The firmware of the card above is version "2.9.1000", which is actually useful to know.

                   

                  (note, the mstflint "query" parameter can be abbreviated to just "q".  I used "query" above because it's easier to mentally follow along with for new users. )

                   

                  (Note - edited to add mstflint yum command)

                    • Re: Failed loading HCA driver and Access Layer

                      It seems it somewhat resovles itself:)

                       

                      as for the otehr info its:

                      [root@dpn08]# lspci |grep Mel
                      03:00.0 Network controller [0207]: Mellanox Technologies MT27500 Family [ConnectX-3]

                       

                      [root@dpn08]# mstflint -d 03:00.0 query
                      Image type:      ConnectX
                      FW Version:      2.11.550
                      Rom Info:        type=PXE  version=3.4.0 devid=4099 proto=VPI
                      Device ID:       4099
                      Description:     Node             Port1            Port2            Sys image
                      GUIDs:           0002c90300389f60 0002c90300389f61 0002c90300389f62 0002c90300389f63
                      MACs:                                 000000000000     000000000000
                      VSD:
                      PSID:            DEL0A20210018

                       

                      the only other issue i have now, is that testing with v2 OFED and Centos 6.4,  yum update fails due to ibutils-libs missing, and stranges thing when trying to install it.

                       

                      its part of the packages available to Centos 6.4, but yum search says nothing is available.

                       

                      will a yum update with ibutils break the OFED package aswell?