2 Replies Latest reply on Dec 7, 2018 2:16 AM by jonas

    how to debug "hangs" related to infiniband adapter

    jonas

      Hi,

       

      on some of our machines we are facing (sporadically but yet "too often") hangs that seem to be related with the Infiniband adapters.

      Meaning: if I issue one of the following commands, it will never return to the shell and won't react to CTRL+C:

      • ibstat  (this seems particularly bad)
      • ping $OTHER_HOSTS_IP_ASSIGNED_TO_INFINIBAND_DEVICE
      • ls /nfs/mount/on/server/connected/via/infiniband
      • sudo ifdown ib0 && sudo ifup ib0

      Of course, after a reboot everything is fine again :-/

       

      I am very new to this, so my troubleshooting skills are weak. I have listed some basic information below and would be grateful for

      further guidance how to debug this issue.

       

      Thanks,

      Jonas

       

       

       

       

      $ grep ib0 /var/log/syslog   # this is around the time when the problem happened

      Dec  5 10:13:29 heinzel60 kernel: [70988.535625] ib0: ipoib_cm_tx_destroy_rss: 7 not completed for QP: 0x257 force cleanup.

      Dec  5 11:55:50 heinzel60 kernel: [77129.295892] ib0: timing out; 7 sends not completed

      Dec  5 11:55:55 heinzel60 kernel: [77134.300207] ib0: timing out; 7 sends not completed

      Dec  5 11:56:00 heinzel60 kernel: [77139.304520] ib0: timing out; 7 sends not completed

      Dec  5 11:56:05 heinzel60 kernel: [77144.308840] ib0: timing out; 7 sends not completed

      Dec  5 11:56:10 heinzel60 kernel: [77149.313159] ib0: timing out; 7 sends not completed

      Dec  5 11:56:10 heinzel60 kernel: [77149.313795] ib0: ipoib_cm_tx_destroy_rss: 7 not completed for QP: 0x265 force cleanup.

      Dec  5 12:08:40 heinzel60 kernel: [77899.392921] ib0: timing out; 7 sends not completed

      Dec  5 12:08:45 heinzel60 kernel: [77904.397237] ib0: timing out; 7 sends not completed

      Dec  5 12:08:50 heinzel60 kernel: [77909.401558] ib0: timing out; 7 sends not completed

      Dec  5 12:08:55 heinzel60 kernel: [77914.405874] ib0: timing out; 7 sends not completed

      Dec  5 12:09:00 heinzel60 kernel: [77919.410197] ib0: timing out; 7 sends not completed

      [...]

       

      $ lspci | grep Mellanox

      01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

       

      $ ibstat  # as said above, this worked only after reboot

      CA 'mlx4_0'

      CA type: MT4099

      Number of ports: 1

      Firmware version: 2.42.5000

      Hardware version: 1

      Node GUID: 0xec0d9a0300062a80

      System image GUID: 0xec0d9a0300062a83

      Port 1:

      State: Active

      Physical state: LinkUp

      Rate: 56

      Base lid: 52

      LMC: 0

      SM lid: 25

      Capability mask: 0x02514868

      Port GUID: 0xec0d9a0300062a81

      Link layer: InfiniBand

       

      $ lsmod  | egrep 'ib|mlx'

      ib_ucm                 20480  0

      ib_uverbs             106496  2 ib_ucm,rdma_ucm

      mlx5_fpga_tools        16384  0

      mlx5_ib               266240  0

      mlx5_core             782336  2 mlx5_ib,mlx5_fpga_tools

      mlxfw                  20480  1 mlx5_core

      ib_iser                49152  0

      rdma_cm                61440  2 ib_iser,rdma_ucm

      libiscsi_tcp           24576  1 iscsi_tcp

      libiscsi               53248  3 libiscsi_tcp,iscsi_tcp,ib_iser

      scsi_transport_iscsi    98304  4 iscsi_tcp,ib_iser,libiscsi

      ib_ipoib              163840  0

      ib_cm                  53248  3 rdma_cm,ib_ucm,ib_ipoib

      ib_umad                24576  0

      mlx4_ib               208896  0

      ib_core               282624  11 rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_iser,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

      libcrc32c              16384  1 raid456

      mlx4_en               135168  0

      vxlan                  49152  2 mlx4_en,mlx5_core

      ptp                    20480  3 igb,mlx4_en,mlx5_core

      libahci                32768  1 ahci

      mlx4_core             348160  2 mlx4_en,mlx4_ib

      mlx_compat             24576  16 rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_iser,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib