4 Replies Latest reply on Feb 25, 2016 1:43 PM by pregier@penguincomputing.com

    PCI-E Bus Errors with ConnectX-3 and Asus X-99E WS

    holografika

      Hi,

       

      I am experiencing several problems when using a ConnectX-3 40GbE adapter (MCX313A-BCBT) in an Asus X99-E WS motherboard.

       

      First it makes system startup quite unstable. Approx. 2 out of 10 tries, the system halts before POST, and shows error code 94 on the 7-segment display of the mainboard (meaning PCI Enumeration Error).

      When it boots successfully, the latest Linux driver (mlnx-en-3.0-1.0.1.tgz), with the latest firmware, with Fedora 21 x86_64 (supported OS), fresh install, with a single NVidia GPU installed besides the HCA, it emits PCI bus errors during initialization. Sometimes it disables the card completely, sometimes it starts to work after a 1-1.5 minute wait during boot. When such errors occur, they look like:

       

      [   10.743067] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

      [   10.743077] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

      [   10.743142] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000

      [   10.743187] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)

      [   10.743225] pcieport 0000:00:02.0: broadcast error_detected message

      [   16.852525] mlx4_core 0000:0a:00.0: command 0xff6 timed out (go bit not cleared)

      [   16.852527] mlx4_core 0000:0a:00.0: RUN_FW command failed, aborting

      [   16.855670] mlx4_core 0000:0a:00.0: mlx4_cmd_post:cmd_pending failed

      [   16.855702] mlx4_core 0000:0a:00.0: Failed to start FW, aborting

      [   17.858368] mlx4_core: probe of 0000:0a:00.0 failed with error -110

      [   17.858638] pcieport 0000:00:02.0: AER: Device recovery failed

      [   17.858643] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

      [   17.858652] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

      [   17.858735] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000

      [   17.858787] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)

      [   17.858832] pcieport 0000:00:02.0: broadcast error_detected message

      [   17.858836] pcieport 0000:00:02.0: AER: Device recovery failed

      ...

      [   61.820905] mlx4_core: device is working in RoCE mode: Roce V1

      [   61.820907] mlx4_core: gid_type 1 for UD QPs is not supported by the devicegid_type 0 was chosen instead

      [   61.820908] mlx4_core: UD QP Gid type is: V1

      [  101.351233] mlx4_core 0000:0a:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s

      [  101.351235] mlx4_core 0000:0a:00.0: PCIe link width is x8, device supports x8

      [  101.354441] mlx4_core 0000:0a:00.0: irq 62 for MSI/MSI-X

      [  101.354445] mlx4_core 0000:0a:00.0: irq 63 for MSI/MSI-X

      [  101.354448] mlx4_core 0000:0a:00.0: irq 64 for MSI/MSI-X

      [  101.354451] mlx4_core 0000:0a:00.0: irq 65 for MSI/MSI-X

      [  101.354453] mlx4_core 0000:0a:00.0: irq 66 for MSI/MSI-X

      [  101.354456] mlx4_core 0000:0a:00.0: irq 67 for MSI/MSI-X

      [  101.354459] mlx4_core 0000:0a:00.0: irq 68 for MSI/MSI-X

      [  101.354462] mlx4_core 0000:0a:00.0: irq 69 for MSI/MSI-X

      [  101.354464] mlx4_core 0000:0a:00.0: irq 70 for MSI/MSI-X

      [  101.354466] mlx4_core 0000:0a:00.0: irq 71 for MSI/MSI-X

      [  101.354469] mlx4_core 0000:0a:00.0: irq 72 for MSI/MSI-X

      [  101.354471] mlx4_core 0000:0a:00.0: irq 73 for MSI/MSI-X

      [  101.354474] mlx4_core 0000:0a:00.0: irq 74 for MSI/MSI-X

      [  102.097189] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called

      [  102.097198] mlx4_core 0000:0a:00.0: device is going to be reset

      [  102.125455] mlx4_en: Mellanox ConnectX HCA Ethernet driver v3.0-1.0.1 (Feb 2014)

      [  103.138702] mlx4_core 0000:0a:00.0: device was reset successfully

      [  103.138717] mlx4_core 0000:0a:00.0: Could not post command 0xd: ret=-5, in_param=0x65ae56000, in_mod=0x100, op_mod=0x0

      [  103.138721] mlx4_core 0000:0a:00.0: SW2HW_MPT failed (-5)

      [  103.138724] mlx4_en 0000:0a:00.0: Failed enabling memory region

      [  104.151519] pcieport 0000:00:02.0: AER: Device recovery failed

      [  104.151526] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

      [  104.151536] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

      [  104.151540] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000

      [  104.151543] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)

      [  104.151548] pcieport 0000:00:02.0: broadcast error_detected message

      [  104.151553] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called

      [  104.151556] ------------[ cut here ]------------

      [  104.151565] WARNING: CPU: 0 PID: 165 at drivers/pci/pci.c:1535 pci_disable_device+0x99/0xb0()

      [  104.151567] mlx4_core 0000:0a:00.0: disabling already-disabled device

      [  104.151569] Modules linked in:

      [  104.151571]  mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) vxlan udp_tunnel nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT xt_conntrack cfg80211 ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_hdmi vfat x86_pkg_temp_thermal fat coretemp kvm crct10dif_pclmul crc32_pclmul snd_hda_intel crc32c_intel eeepc_wmi asus_wmi snd_hda_controller sparse_keymap rfkill snd_hda_codec iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_hwdep snd_seq snd_seq_device snd_pcm sb_edac snd_timer serio_raw edac_core snd soundcore

      [  104.151619]  mlx4_core(OE) mlx_compat(OE) mei_me i2c_i801 lpc_ich mei mfd_core shpchp tpm_infineon tpm_tis tpm nouveau video mxm_wmi igb drm_kms_helper ttm e1000e drm dca ata_generic ptp i2c_algo_bit pata_acpi pps_core wmi [last unloaded: mlx4_core]

      [  104.151642] CPU: 0 PID: 165 Comm: kworker/0:2 Tainted: G           OE  3.17.4-301.fc21.x86_64 #1

      [  104.151644] Hardware name: ASUS All Series/X99-E WS, BIOS 1102 04/28/2015

      [  104.151650] Workqueue: events aer_isr

      [  104.151653]  0000000000000000 0000000017f53b38 ffff880659a8bbe8 ffffffff8173f929

      [  104.151657]  ffff880659a8bc30 ffff880659a8bc20 ffffffff810970ad ffff88065ccbc000

      [  104.151661]  ffff88065cc60510 0000000000000001 ffff880658ecfb10 ffff88065cc85800

      [  104.151665] Call Trace:

      [  104.151671]  [<ffffffff8173f929>] dump_stack+0x45/0x56

      [  104.151678]  [<ffffffff810970ad>] warn_slowpath_common+0x7d/0xa0

      [  104.151683]  [<ffffffff8109712c>] warn_slowpath_fmt+0x5c/0x80

      [  104.151696]  [<ffffffffa0354938>] ? mlx4_enter_error_state.part.7+0x188/0x350 [mlx4_core]

      [  104.151704]  [<ffffffff813c3d09>] pci_disable_device+0x99/0xb0

      [  104.151720]  [<ffffffffa036b117>] mlx4_pci_err_detected+0x77/0xa0 [mlx4_core]

      [  104.151725]  [<ffffffff813d71e0>] report_error_detected+0x50/0x100

      [  104.151730]  [<ffffffff813d7190>] ? find_source_device+0x80/0x80

      [  104.151734]  [<ffffffff813bc7a9>] pci_walk_bus+0x79/0xa0

      [  104.151738]  [<ffffffff813d7190>] ? find_source_device+0x80/0x80

      [  104.151742]  [<ffffffff813d6a4c>] broadcast_error_message+0xdc/0x100

      [  104.151746]  [<ffffffff813d6ab3>] do_recovery+0x43/0x280

      [  104.151750]  [<ffffffff813d67a9>] ? get_device_error_info+0xd9/0x1b0

      [  104.151754]  [<ffffffff813d769a>] aer_isr+0x36a/0x450

      [  104.151761]  [<ffffffff810af88d>] process_one_work+0x14d/0x400

      [  104.151765]  [<ffffffff810b021b>] worker_thread+0x6b/0x4a0

      [  104.151770]  [<ffffffff810b01b0>] ? rescuer_thread+0x2a0/0x2a0

      [  104.151773]  [<ffffffff810b52fa>] kthread+0xea/0x100

      [  104.151777]  [<ffffffff810b5210>] ? kthread_create_on_node+0x1a0/0x1a0

      [  104.151783]  [<ffffffff81746a3c>] ret_from_fork+0x7c/0xb0

      [  104.151787]  [<ffffffff810b5210>] ? kthread_create_on_node+0x1a0/0x1a0

      [  104.151789] ---[ end trace 858d8c660747219b ]---

      [  104.151793] pcieport 0000:00:02.0: AER: Device recovery failed

      [  104.151796] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

      [  104.151803] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

      [  104.151807] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000

      [  104.151810] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)

      ...

       

      When using the same Mellanox card in a different mainboard (for example, a Gigabyte GA-Z97X-UD3H), it boots and inits flawlessly, using the exact same OS.

      We have a cluster built up from these boards, and they all have the same issue randomly, so it's not a unique error of a single mainboard, but looks like some incompatibility.

       

      Did anybody experience a similar issue?

      Please share any suggestions about how to stabilize this.

       

      Thanks,

      Peter