2 Replies Latest reply on Jun 9, 2018 10:57 AM by martijn@mellanox.com

    PCIe Bus Errors with ConnectX-3 Pro and ESC8000 G3

    abdullin

      Hello.

       

      We have some problems with the MCX312B and ASUS server platform ESC8000 G3.

       

      Information about server:

      Driver: 4.2-1.0.1
      OS: ubuntu 14.04 4.4.0-116-generic
      2 x MCX312B
      8 x Nvidia 1080G GPU

       

      We saw  errors: AER error: Uncorrected (Non-Fatal) error received: id = 0010 for both network cards. After that, the network cards were resetted.

      This error occurs randomly.

       

      Log:

      May 28 06:15:04  [4196877.274044] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

      May 28 06:15:04  [4196877.274829] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Reque

      ster ID)

      May 28 06:15:04  [4196877.276009] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000

      May 28 06:15:04  [4196877.276607] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)

      May 28 06:15:04  [4196877.277172] pcieport 0000:00:02.0: broadcast error_detected message

      May 28 06:15:04  [4196877.277719] mlx4_core 0000:04:00.0: mlx4_pci_err_detected was called

      May 28 06:15:04  [4196877.278251] mlx4_core 0000:04:00.0: device is going to be reset

      May 28 06:15:04  [4196877.278763] mlx4_core 0000:04:00.0: crdump: Dump was already collected, skipping

      May 28 06:15:05  [4196878.280748] mlx4_core 0000:04:00.0: device was reset successfully

      May 28 06:15:05  [4196878.281297] mlx4_en 0000:04:00.0: Internal error detected, restarting device

      May 28 06:15:05  [4196878.281301] mlx4_core 0000:04:00.0: Could not post command 0x49: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

      May 28 06:15:05  [4196878.281310] mlx4_core 0000:04:00.0: Could not post command 0x43: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

      May 28 06:15:05  [4196878.282838] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started

      May 28 06:15:05  [4196878.283377] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended

      May 28 06:15:05  [4196878.284084] mlx4_en: eth2: Close port called

      May 28 06:15:05  [4196878.300391] mlx4_core 0000:04:00.0: Fail to set mac in port 1 during unregister

      May 28 06:15:06  [4196878.342788] bond2: Releasing active interface eth2

      May 28 06:15:06  [4196878.347538] bond2: the permanent HWaddr of eth2 - ec:0d:9a:17:64:00 - is still in use by bond2 - set the HWaddr of eth2 to a different address to avoid conflicts

      May 28 06:15:06  [4196878.348620] bond2: first active interface up!

      May 28 06:15:06  [4196878.368454] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:06  [4196878.369261] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:06  [4196878.369824] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:06  [4196878.370381] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:06  [4196878.370937] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:06  [4196878.371501] device eth2 left promiscuous mode

      May 28 06:15:06  [4196878.372066] mlx4_en: eth2: Failed to pass user MAC(ec:0d:9a:17:64:00) to Firmware for port 1, with error -5

      May 28 06:15:06  [4196878.456492] mlx4_en 0000:04:00.0: removed PHC

      May 28 06:15:06  [4196878.457538] mlx4_en: eth3: Close port called

      May 28 06:15:06  [4196878.472373] mlx4_core 0000:04:00.0: Fail to set mac in port 2 during unregister

      May 28 06:15:06  [4196878.509546] bond1: Releasing active interface eth3

      May 28 06:15:06  [4196878.514276] bond1: the permanent HWaddr of eth3 - ec:0d:9a:17:64:01 - is still in use by bond1 - set the HWaddr of eth3 to a different address to avoid conflicts

      May 28 06:15:06  [4196878.515467] bond1: first active interface up!

      May 28 06:15:06  [4196878.532430] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:06  [4196878.533167] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:06  [4196878.533770] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:06  [4196878.534345] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:06  [4196878.534907] mlx4_en: eth3: Failed to pass user MAC(ec:0d:9a:17:64:01) to Firmware for port 2, with error -5

      May 28 06:15:07  [4196879.660429] mlx4_core 0000:05:00.0: mlx4_pci_err_detected was called

      May 28 06:15:07  [4196879.661211] mlx4_core 0000:05:00.0: device is going to be reset

      May 28 06:15:07  [4196879.661837] mlx4_core 0000:05:00.0: crdump: Dump was already collected, skipping

      May 28 06:15:08  [4196880.665907] mlx4_core 0000:05:00.0: device was reset successfully

      May 28 06:15:08  [4196880.666581] mlx4_en 0000:05:00.0: Internal error detected, restarting device

      May 28 06:15:08  [4196880.666584] mlx4_core 0000:05:00.0: Could not post command 0x49: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

      May 28 06:15:08  [4196880.667885] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started

      May 28 06:15:08  [4196880.668564] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended

      May 28 06:15:08  [4196880.669483] mlx4_en: eth4: Close port called

      May 28 06:15:08  [4196880.684153] mlx4_core 0000:05:00.0: Fail to set mac in port 1 during unregister

      May 28 06:15:08  [4196880.724077] bond2: Removing an active aggregator

      May 28 06:15:08  [4196880.728869] bond2: Releasing active interface eth4

      May 28 06:15:08  [4196880.748641] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:08  [4196880.749286] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:08  [4196880.749912] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:08  [4196880.750529] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:08  [4196880.751139] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

      May 28 06:15:08  [4196880.751733] device eth4 left promiscuous mode

      May 28 06:15:08  [4196880.752338] mlx4_en: eth4: Failed to pass user MAC(ec:0d:9a:17:63:e0) to Firmware for port 1, with error -5

      May 28 06:15:08  [4196881.108237] mlx4_en 0000:05:00.0: removed PHC

      May 28 06:15:08  [4196881.109386] mlx4_en: eth5: Close port called

      May 28 06:15:08  [4196881.124131] mlx4_core 0000:05:00.0: Fail to set mac in port 2 during unregister

      May 28 06:15:08  [4196881.167114] bond1: Removing an active aggregator

      May 28 06:15:08  [4196881.171891] bond1: Releasing active interface eth5

      May 28 06:15:08  [4196881.184362] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:08  [4196881.184982] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:08  [4196881.185573] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:08  [4196881.186138] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

      May 28 06:15:08  [4196881.186696] mlx4_en: eth5: Failed to pass user MAC(ec:0d:9a:17:63:e1) to Firmware for port 2, with error -5

      May 28 06:15:10  [4196882.368172] pcieport 0000:00:02.0: broadcast slot_reset message

      May 28 06:15:10  [4196882.369063] mlx4_core 0000:04:00.0: mlx4_pci_slot_reset was called

      May 28 06:15:10  [4196882.371798] mlx4_core 0000:05:00.0: mlx4_pci_slot_reset was called

      May 28 06:15:10  [4196882.377890] pcieport 0000:00:02.0: broadcast resume message

      May 28 06:15:10  [4196882.378505] mlx4_core 0000:04:00.0: mlx4_pci_resume was called

      May 28 06:15:15  [4196887.953110] mlx4_core: device is working in RoCE mode: Roce V1

      May 28 06:15:15  [4196887.953742] mlx4_core: UD QP Gid type is: V1

      May 28 06:15:17  [4196889.766596] mlx4_core 0000:04:00.0: DMFS high rate steer mode is: performance optimized for limited rule configuration (static)

      May 28 06:15:17  [4196889.768097] mlx4_core 0000:04:00.0: PCIe BW is different than device's capability

      May 28 06:15:27  [4196900.129419] mlx4_core 0000:05:00.0: PCIe BW is different than device's capability

      May 28 06:15:27  [4196900.129961] mlx4_core 0000:05:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s

      May 28 06:15:27  [4196900.130522] mlx4_core 0000:05:00.0: PCIe link width is x8, device supports x8

      May 28 06:15:28  [4196900.891587] mlx4_en 0000:05:00.0: Activating port:1

      May 28 06:15:28  [4196900.911848] mlx4_en: 0000:05:00.0: Port 1: Using 32 TX rings

      May 28 06:15:28  [4196900.912628] mlx4_en: 0000:05:00.0: Port 1: Using 16 RX rings

      May 28 06:15:28  [4196900.913632] mlx4_en: 0000:05:00.0: Port 1: Initializing port

      May 28 06:15:28  [4196900.916585] mlx4_en 0000:05:00.0: registered PHC clock

      May 28 06:15:28  [4196900.917588] mlx4_en 0000:05:00.0: Activating port:2

      May 28 06:15:28  [4196900.921897] mlx4_en: 0000:05:00.0: Port 2: Using 32 TX rings

      May 28 06:15:28  [4196900.922442] mlx4_en: 0000:05:00.0: Port 2: Using 16 RX rings

      May 28 06:15:28  [4196900.924792] mlx4_en: 0000:05:00.0: Port 2: Initializing port

      May 28 06:15:28  [4196900.943370] <mlx4_ib> mlx4_ib_add: counter index 2 for port 1 allocated 1

      May 28 06:15:28  [4196900.943892] <mlx4_ib> mlx4_ib_add: counter index 3 for port 2 allocated 1

      May 28 06:15:28  [4196900.963892] pcieport 0000:00:02.0: AER: Device recovery successful

      May 28 06:15:28  [4196900.982030] mlx4_en: eth4: Link Up

      May 28 06:15:28  [4196900.982576] mlx4_en: eth5: Link Up

       

      Is it a correct behavior for the network card to be resetted after this error?

      Did anybody experience a similar issue?

       

      Please share any suggestions about how to fix this.