Getting "mlx4_core 0000:02:00.0: command 0x24 failed: fw status = 0x30" in dmesg on the IB nodes.
Does anyone know what "command 0x24" referring to ?
The time out for “command 0x24 (go bit not cleared)” is MAD operation, and it is issued by firmware.
this “go bit” error indicates that the firmware is trying to execute a MAD command while it's queue is full/hang. Usually upgrading the NICs firmware to the compatible & latest fw will resolve the issue