Please check you have the latest firmware version installed.(16.22.1002)
ibv_devinfo will give you your fw version.
mst status -v (to see your current device)
mlxconfig -d /dev/mst/<your mst device> set HOST_CHAINING_MODE=1
See the release notes of the FW here:
Thanks for your answer! The document you refer to states that
Both ports should be configured to Ethernet when host chaining is enabled.
Is there any way to connect the four nodes without a switch using native InfiniBand?
Only ethernet is supported.
question: do you want to use Storage Spaces Direct in Windows Server 2016 with it? That is at least my problem.
Cheers Carsten Rachfahl
Microsoft Clud & Datacenter Managment MVP
Putting this out there since we had so many complications with host chaining in order for it to work; and something Google will pick up is infinitely better than nothing.
The idea we had was that we wanted something that would have redundancy. With a switch configuration, we'd have to get two switches, and a lot more cables; very expensive.
HOST_CHAINING_MODE was a great idea, switchless, less cables, and less expense.
You do NOT need a subnet manager for this to work!
In order to get it working:
Aside: There is no solid documentation on this process as of this writing
1. What Marc said was accurate, set HOST_CHAINING_MODE=1 via the mlxconfig utility.
Aside: Both the VPI and EN type cards will work with host chaining. The VPI type does require you to put it into ethernet mode.
2. Restart the servers to set the mode.
3. Put all of the ports on the same subnet. EG. 172.19.50.0/24 Restart networking stack as required.
4. From there, all ports should be pingable from all other ports.
5. Set the MTU up to 9000. (see caveats for bug; lower to 8000 if 9k doesn't work)
Aside: The MTU could be higher; I have been unable to test higher due to a bug in the firmware. Around these forums, I've seen 9k floated about, and it seems like a good standard number.
If you aren't getting the throughput you're expecting, do ALL of the tuning from BIOS (Performance Tuning for Mellanox Adapters , BIOS Performance Tuning Example ) and software (Understanding PCIe Configuration for Maximum Performance , Linux sysctl Tuning ) for all servers. It does make a difference. On our small (under-powered) test boxes, we gained 20 GBit/s from our starting benchmark.
Another thing to make sure is that you have the proper PCI bandwidth to support line rate; and get the socket direct cards if you do not.
There are a lot of caveats.
- The bandwidth that is possible IS link speed, only between two directly connected nodes. From our tests, there is a small dip in performance on each hop; and each hop also limits your max theoretical throughput.
- FW version 16.22.1002 had a few bugs related to host chaining; one of those was the max MTU supported was 8150. Higher MTU, less IP overhead.
- The 'ring' topology is a little funny. It is only one direction. If there is a cable cut scenario, it will NOT route around properly for certain hosts.
Aside: A cable cut is different than a cable disconnect. The transceiver itself registers whether there is a cable attached or not. When there is no cable present on one side, but is on the other, the above scenario is true (not properly routing.) When both sides of the cable are removed, the ring outright stops and does not work at all. I don't have any data to support an actual cable cut.
The ring works as described in the (scant) documentation, but is as follows from the firmware release notes:
- Received packets from the wire with DMAC equal to the host MAC are forwarded to the local host
- Received traffic from the physical port with DMAC different than the current MAC are forwarded to the other port:
- Traffic can be transmitted by the other physical port
- Traffic can reach functions on the port's Physical Function
- Device allows hosts to transmit traffic only with its permanent MAC
- To prevent loops, the received traffic from the wire with SMAC equal to the port permanent MAC is dropped (the packet cannot start a new loop)
If you run into problems, tcpdump is your friend, and ping is a great little tool to check your sanity.
Hope any of this helps anyone in the future,