Please check you have the latest firmware version installed.(16.22.1002)
ibv_devinfo will give you your fw version.
mst status -v (to see your current device)
mlxconfig -d /dev/mst/<your mst device> set HOST_CHAINING_MODE=1
See the release notes of the FW here:
Thanks for your answer! The document you refer to states that
Both ports should be configured to Ethernet when host chaining is enabled.
Is there any way to connect the four nodes without a switch using native InfiniBand?
Only ethernet is supported.
question: do you want to use Storage Spaces Direct in Windows Server 2016 with it? That is at least my problem.
Cheers Carsten Rachfahl
Microsoft Clud & Datacenter Managment MVP
Putting this out there since we had so many complications with host chaining in order for it to work; and something Google will pick up is infinitely better than nothing.
The idea we had was that we wanted something that would have redundancy. With a switch configuration, we'd have to get two switches, and a lot more cables; very expensive.
HOST_CHAINING_MODE was a great idea, switchless, less cables, and less expense.
You do NOT need a subnet manager for this to work!
In order to get it working:
Aside: There is no solid documentation on this process as of this writing
1. What Marc said was accurate, set HOST_CHAINING_MODE=1 via the mlxconfig utility.
Aside: Both the VPI and EN type cards will work with host chaining. The VPI type does require you to put it into ethernet mode.
2. Restart the servers to set the mode.
3. Put all of the ports on the same subnet. EG. 172.19.50.0/24 Restart networking stack as required.
4. From there, all ports should be pingable from all other ports.
5. Set the MTU up to 9000. (see caveats for bug; lower to 8000 if 9k doesn't work)
Aside: The MTU could be higher; I have been unable to test higher due to a bug in the firmware. Around these forums, I've seen 9k floated about, and it seems like a good standard number.
If you aren't getting the throughput you're expecting, do ALL of the tuning from BIOS (Performance Tuning for Mellanox Adapters , BIOS Performance Tuning Example ) and software (Understanding PCIe Configuration for Maximum Performance , Linux sysctl Tuning ) for all servers. It does make a difference. On our small (under-powered) test boxes, we gained 20 GBit/s from our starting benchmark.
Another thing to make sure is that you have the proper PCI bandwidth to support line rate; and get the socket direct cards if you do not.
There are a lot of caveats.
- The bandwidth that is possible IS link speed, only between two directly connected nodes. From our tests, there is a small dip in performance on each hop; and each hop also limits your max theoretical throughput.
- FW version 16.22.1002 had a few bugs related to host chaining; one of those was the max MTU supported was 8150. Higher MTU, less IP overhead.
- The 'ring' topology is a little funny. It is only one direction. If there is a cable cut scenario, it will NOT route around properly for certain hosts.
Aside: A cable cut is different than a cable disconnect. The transceiver itself registers whether there is a cable attached or not. When there is no cable present on one side, but is on the other, the above scenario is true (not properly routing.) When both sides of the cable are removed, the ring outright stops and does not work at all. I don't have any data to support an actual cable cut.
The ring works as described in the (scant) documentation, but is as follows from the firmware release notes:
- Received packets from the wire with DMAC equal to the host MAC are forwarded to the local host
- Received traffic from the physical port with DMAC different than the current MAC are forwarded to the other port:
- Traffic can be transmitted by the other physical port
- Traffic can reach functions on the port's Physical Function
- Device allows hosts to transmit traffic only with its permanent MAC
- To prevent loops, the received traffic from the wire with SMAC equal to the port permanent MAC is dropped (the packet cannot start a new loop)
If you run into problems, tcpdump is your friend, and ping is a great little tool to check your sanity.
Hope any of this helps anyone in the future,
I wanted to thank you for this directions they were very helpful. I was successful in linking three nodes together, all running Ubuntu 18.04. I was able to get ~96Gbs in speed between all the host using iperf2. I then took one of the boxes and loaded ESXi 6.7, and configured the same IP address on the two interface I had before. The VMware box can not communicate with the others now. I can communicate through the Nic between the other Ubuntu boxes. When I run a tcpdump on the ESXi I see the ARP request getting created, but get no response. I am wondering if you have any idea why the Chaining feature does not seem to work with ESXi?
I'm glad I helped someone after all the headache I went through for it.
I have no hard experience with VMWare, and so take all of this with a grain of salt.
First thought is vlan tags. I was told that VMWare tags by default.
From my (limited) understanding and thoughts, host chaining inside VMware is not a good idea.
If you setup a virtual switch (on the vmware side) and put both ports of the card on the switch, give that switch an IP, that would allow for vmotion and such over the link at close to line speed. Letting the switch (analogous to openvswitch) do all of the routing, and fast pathing.
Thoughts - If there was host chaining:
Vmware still sees both ports (we can't assign IPs to raw port interfaces to start with.)
It doesn't really know which port to send out, so it could take the extra hop before it gets to the destination.
Three node, desired going from A -> B might take the path of A -> C -> B
Where I can talk is non-chaining speed.
We did try using openswitch and the cards with chaining off. So long as the stp stuff is turned on; we got nearly line speed.
We opened a support ticket for our problems with MTU. It took a while, but we found the problem.
They have a nice little utility (sysinfo-snapshot) for seeing the card internals and OS config options which helped us (by looking through it.)