Within the last 8 months or so, I recently upgraded our cluster with ConnectX-3 MCX354A-FCBT cards and 36-port SwitchX MSX6025T-1SFS unmanaged switch--all brand new in box. We have been experiencing crashes/reboots of two of the nodes and I'm wondering if it's related to the infiniband? Currently the cluster only uses the fabric for NFSoRDMA.
There are 9 nodes in total connected at 40Gb FDR10. The OpenSM manager is running on two of the nodes. The nodes are running CentOS 6.9 with 2.6.32-696.1.1.el6.x86_64 kernel and CentOS nfs-rdma package/kernel modules. The HCA's all have recent firmware 2.40.7000.
One compute node has been returned to the vendor twice and they are now replacing all the hardware except the HCA since I purchased/installed that separately (but we tried swapping HCA's with the other nodes and the same node crashed/rebooted).
Now our NFS server node has crashed twice in 24 hrs and currently the infiniband is not connected until this can be resolved, and we are using the 1Gb port instead. Both machines are less than 6 months old and the NFS server has been connected to the IB fabric less than a month.
The NFS server uses an Asus Z10PE-D16 board with single Intel Xeon E5-2620v4 CPU.
I attached the latest boot.log from the NFS server.
I'm wondering if heavy load on the NFS can cause this? At the time of the last crash the load avg of the server was about 70% with 8 nfsd processes busy. But we have run similar jobs in the past month that didn't cause reboot....
Could it be lack of RAM? The NFS server has 64GB but I would think the OS would manage it accordingly...
I just noticed the cables the vendor sold us are MC2210130 40Gb ethernet, could that be it? Should it be IB FDR10 cable, MC2206130 for example?
Would appreciate your insights so I can figure out why our server keeps crashing. Let me know if you'd like more details about anything.
boot.log.zip 16.5 KB