It's a bit hard to understand what actually happened without looking at the full kernel log. but the first issue looks like a memory issue with QP registrations which was most likely caused by an issue previous to that. most commonly would be the firmware getting stuck, PCI issue etc...I would swap the HCA with another one to see if the issue follows the card or not.
as for upgrading, this is a really old HCA, so newer MFT versions will most likely not work with it.Are you still in that state even after the server is rebooted ? what does "mst status" show ?
Here's a more complete log output:
Nov 18 14:28:29 alin kernel: [ 9.977168] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
Nov 18 14:28:29 alin kernel: [ 9.977170] ib_mthca: Initializing 0000:02:00.0
Nov 18 14:28:29 alin kernel: [ 11.374057] ib_mthca 0000:02:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
Nov 18 14:28:29 alin kernel: [ 11.374059] ib_mthca 0000:02:00.0: If you have problems, try updating your HCA FW.
Nov 18 14:29:10 alin kernel: [ 59.296536] ib1: ib_dealloc_pd failed
Nov 18 14:31:22 alin kernel: [ 167.880313] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)
Nov 18 14:33:16 alin kernel: [ 281.265414] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)
Nov 18 14:33:22 alin kernel: [ 287.885556] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)
Nov 18 14:34:16 alin kernel: [ 341.266202] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)
Nov 18 14:34:22 alin kernel: [ 347.886276] mthca0: ib_query_port 1 failed
It suggests a firmware update and you can see more errors.
I don't have the 'mst' command. I installed the debian package mstflint:
mstflint - Mellanox firmware burning application
Which comes with: mstconfig mstflint mstmcra mstmread mstmtserver mstmwrite mstregdump mstvpd
Rebooting does solve the problem.
I should mention, if I don't put an IP address on the card and connect to the network, I can unload the modules in this order (unlike my example above):
modprobe -r ib_ipoib
modprobe -r ib_umad
modprobe -r mlx4_ib
Nevertheless, if I load the modules once again in the correct order I don't get an IB0 or IB1 interface and ibstatus shows:
Fatal error: device '*': sys files not found (/sys/class/infiniband/*/ports)
/usr/sbin/ibstatus: 21: exit: Illegal number: -1
Note: this is all without suspend/resume being involved. So basically, I can only load the modules once and have connectivity, subsequent reloads will render the card unresponsive and nothing shows up in the log files or dmesg. If I can solve that problem, then I could probably get suspend/resume to work.
Thanks for the explanation.
I'm not totally sure how this old HCA FW handles a state where modules are shutdown from pm-suspend.
I would start with going to reboot this server and going to step 1. making sure that I have the latest OFED for your Debian OS and FW before attempting to do these kind of tests.
if you can list exactly what you have we may be able to locate the necessary drivers (although they're antics)