Hi All - have a question about infiniband and subnet manager, using the Mellanox OFED open source packages.
It's the classic "it worked before the upgrade, but doesn't now" problem.
Here's the "before" configuration :
The hardware is an HP C7000 blade enclosure, with 16 servers/blades installed. These 16 servers are identical ProLiant BL460c Gen8 blades, each with an infiniband card installed (lspci shows "21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]" for the infiniband card). The blade chassis has an integrated infiniband switch "HP BLc 4X QDR IB Switch". This switch does not run a subnet manager.
Each of the 16 servers was running Centosv5.9, with the Mellanox OFED version 1.5.3 infiniband packages, which includes OpenSM version 3.3.13. The first blade (we'll call it "01") was running the opensm daemon. All was well, all worked, all was good.
Then, I began upgrading the servers to Centosv6.9. This change caused me to have to "upgrade" the Mellanox OFED software to version 4.0.2, which includes the OpenSM version 4.9.0. I did a few servers at a time, but left server "01" to the last, since it was running the subnet manager. Finally, I'm at today's state, which is node "01" still at Centosv5.9, still running the "old" subnet manager, and all the other nodes updated to Centosv6.9 and the newer OFED packages. There was no change to any of the hardware in the blade chassis at all. All is still working well.
Now, before I update the final "01" node, I need to make sure that the new opensm will work. So I stopped the opensm on node "01", and started it on one of the updated nodes - "02". opensm starts, and runs, but does not seem to be working - that is, none of the other nodes' infiniband ports go to "Active" (as viewed by the ibstat command) - they stay in the "Init" state, and there is no connectivity between them.
If I stop the opensm on node "02" and restarted the old one on node "01", all the other nodes' ports go right to "Active" state and all is well.
I'm not an infiniband expert by any means - I use it in our HPC cluster and I know the basics, but it usually just works and I've not had to dive into the depths of it before. I've tried a few things based on some Google searches I've done, but no luck do far. So, I need some ideas on what to check or try - maybe this particular version is a lemon ? Maybe there's a config that needs to be set with the newer versions ? Maybe it's a driver issue ? (although I'm not sure how that could be, since the infiniband works fine on the newer OS and OFED installs).
All suggestions welcome !