3 Replies Latest reply on Nov 24, 2014 12:33 AM by ferbs

    InfiniHost III Ex - Suspend/Resume not working on Debian Linux

    melon_x

      Hello,

       

      I'm using an InfiniHost III Ex / MT25208 on Debian/Jessie and after running 'pm-suspend' and then resuming, my network stops responding. If I try to use ibstat or ibstatus then I will experience hangs and then finally an error message appears related to ib_mthca:

       

      ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

       

      Here is a list of modules loaded on startup:

       

      mlx4_ib

      ib_umad

      ib_ipoib

       

      I've also tried unloading the modules before suspending like this:

       

      /etc/init.d/opensm stop

      modprobe -r ib_ipoib

      modprobe -r ib_umad

      modprobe -r mlx4_ib

      modprobe -r ib_mthca

       

      But when I reload the modules my ib1 interface does not appear. This happens even if I don't suspend.

       

      Btw, I've attempted to update the firmware but I can't get anything to work. Examples:

       

      lspci -d 15:b3 = nothing

      ibv_devinfo | grep hca_id = Failed to get IB devices list: Function not implemented.

      mstflint -d 02:00.0 q = -E- Cannot open Device: 02:00.0. File exists MFE_OLD_DEVICE_TYPE

       

      plain lspci =

      02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)

        • Re: InfiniHost III Ex - Suspend/Resume not working on Debian Linux
          ferbs

          Hi,

           

          It's a bit hard to understand what actually happened without looking at the full kernel log. but the first issue looks like a memory issue with QP registrations which was most likely caused by an issue previous to that. most commonly would be the firmware getting stuck, PCI issue etc...I would swap the HCA with another one to see if the issue follows the card or not.

           

          as for upgrading, this is a really old HCA, so newer MFT versions will most likely not work with it.Are you still in that state even after the server is rebooted ? what does "mst status" show ?

            • Re: InfiniHost III Ex - Suspend/Resume not working on Debian Linux
              melon_x

              Here's a more complete log output:

               

              Nov 18 14:28:29 alin kernel: [    9.977168] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)

              Nov 18 14:28:29 alin kernel: [    9.977170] ib_mthca: Initializing 0000:02:00.0

              Nov 18 14:28:29 alin kernel: [   11.374057] ib_mthca 0000:02:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).

              Nov 18 14:28:29 alin kernel: [   11.374059] ib_mthca 0000:02:00.0: If you have problems, try updating your HCA FW.

              Nov 18 14:29:10 alin kernel: [   59.296536] ib1: ib_dealloc_pd failed

              Nov 18 14:31:22 alin kernel: [  167.880313] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)

              Nov 18 14:33:16 alin kernel: [  281.265414] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

              Nov 18 14:33:22 alin kernel: [  287.885556] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)

              Nov 18 14:34:16 alin kernel: [  341.266202] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

              Nov 18 14:34:22 alin kernel: [  347.886276] mthca0: ib_query_port 1 failed

               

              It suggests a firmware update and you can see more errors.

               

              I don't have the 'mst' command. I installed the debian package mstflint:

               

              mstflint - Mellanox firmware burning application

               

              Which comes with: mstconfig    mstflint     mstmcra      mstmread     mstmtserver  mstmwrite    mstregdump   mstvpd

               

              Rebooting does solve the problem.

               

              I should mention, if I don't put an IP address on the card and connect to the network, I can unload the modules in this order (unlike my example above):

               

              modprobe -r ib_ipoib

              modprobe -r ib_umad

              modprobe -r mlx4_ib

               

              Nevertheless, if I load the modules once again in the correct order I don't get an IB0 or IB1 interface and ibstatus shows:

               

              Fatal error:  device '*': sys files not found (/sys/class/infiniband/*/ports)

              /usr/sbin/ibstatus: 21: exit: Illegal number: -1

               

              Note: this is all without suspend/resume being involved. So basically, I can only load the modules once and have connectivity, subsequent reloads will render the card unresponsive and nothing shows up in the log files or dmesg. If I can solve that problem, then I could probably get suspend/resume to work.

                • Re: InfiniHost III Ex - Suspend/Resume not working on Debian Linux
                  ferbs

                  Hi,

                   

                  Thanks for the explanation.

                   

                  I'm not totally sure how this old HCA FW handles a state where modules are shutdown from pm-suspend.

                  I would start with going to reboot this server and going to step 1. making sure that I have the latest OFED for your Debian OS and FW before attempting to do these kind of tests.

                   

                  if you can list exactly what you have we may be able to locate the necessary drivers (although they're antics)