1 of 1 people found this helpful
we face a similar problem. The only way to solve it is to run a pre-prod cluster. install the updates and test it. (takes a few hours)
we use HP C7000 blades with HP BL460 and 465 G7 Servers using ConnectX2 Mezz cards. Firmware Vers HP 2.7 (latest is mellanox 2.9 but we cant get that to work on HP Mezz cards.... yet)
the pre-prod cluster is 2 node of older BL460G6 servers with min ram and 1cpu. Its on seperate IB fabric also. +2 more nodes with esx and centos initiators (srp)
Our target servers are on Centos 6.3 atm 2.6.32 using OFED 1.5.3 with sCST 2.2 Its fairly stable. however please note our kernel is custom due to SCST requirements. meaning its not easy upgrade, to rebuild another kernel for each individual machine.
We are considering trying OFED 2 driver, with SCST 3.0 on Ubuntu 12.10
2 good resources are HOWTO: Infiniband SRP Target on CentOS 6 incl RPM SPEC | Andy's Tech Blog
each show what step are taken to rebuild the kernel. we are adopting the 1st blog from Andy, as he rebuilds the RPM's and can easily test them in pre-prod, and package up the RPM's to be installed on multiple nodes quickly.
hope this helps?
Thanks! Your links are helpful. The idea of using a non-cluster system to build the OFED RPMs was especially useful. The comments appended to the first link have some interesting opinions regarding OFED and maintenance. Like one of the commenters, I too would be thrilled if the Linux kernel shipped with IB drivers for our hardware.
Starting with a RHEL 6.2 x86_64 VM, I've run 'yum update kernel' to get the latest kernel. After rebooting, I ran mlnx_add_kernel_support.sh with no reported errors. This built an ISO including kernel-ib, kernel-mft, and knem RPMs for the new kernel. This gives me some confidence that OFED will build OK on my cluster if/when I update the kernel over there.
Testing-wise, I don't have any spare IB NICs lying around so my least worst option will be to adopt a couple of compute nodes for a few hours. Next, I'd either have to plow ahead with upgrading the rest of the nodes or roll back the updated nodes from an HP CMU image. Testing on a production cluster adds to the challenge, I suppose.