Please check you have the latest firmware version installed.(16.22.1002)
ibv_devinfo will give you your fw version.
mst status -v (to see your current device)
mlxconfig -d /dev/mst/<your mst device> set HOST_CHAINING_MODE=1
See the release notes of the FW here:
Thanks for your answer! The document you refer to states that
Both ports should be configured to Ethernet when host chaining is enabled.
Is there any way to connect the four nodes without a switch using native InfiniBand?
Only ethernet is supported.
question: do you want to use Storage Spaces Direct in Windows Server 2016 with it? That is at least my problem.
Cheers Carsten Rachfahl
Microsoft Clud & Datacenter Managment MVP
Putting this out there since we had so many complications with host chaining in order for it to work; and something Google will pick up is infinitely better than nothing.
The idea we had was that we wanted something that would have redundancy. With a switch configuration, we'd have to get two switches, and a lot more cables; very expensive.
HOST_CHAINING_MODE was a great idea, switchless, less cables, and less expense.
You do NOT need a subnet manager for this to work!
In order to get it working:
Aside: There is no solid documentation on this process as of this writing
1. What Marc said was accurate, set HOST_CHAINING_MODE=1 via the mlxconfig utility.
Aside: Both the VPI and EN type cards will work with host chaining. The VPI type does require you to put it into ethernet mode.
2. Restart the servers to set the mode.
3. Put all of the ports on the same subnet. EG. 172.19.50.0/24 Restart networking stack as required.
4. From there, all ports should be pingable from all other ports.
5. Set the MTU up to 9000. (see caveats for bug; lower to 8000 if 9k doesn't work)
Aside: The MTU could be higher; I have been unable to test higher due to a bug in the firmware. Around these forums, I've seen 9k floated about, and it seems like a good standard number.
If you aren't getting the throughput you're expecting, do ALL of the tuning from BIOS (Performance Tuning for Mellanox Adapters , BIOS Performance Tuning Example ) and software (Understanding PCIe Configuration for Maximum Performance , Linux sysctl Tuning ) for all servers. It does make a difference. On our small (under-powered) test boxes, we gained 20 GBit/s from our starting benchmark.
Another thing to make sure is that you have the proper PCI bandwidth to support line rate; and get the socket direct cards if you do not.
There are a lot of caveats.
- The bandwidth that is possible IS link speed, only between two directly connected nodes. From our tests, there is a small dip in performance on each hop; and each hop also limits your max theoretical throughput.
- FW version 16.22.1002 had a few bugs related to host chaining; one of those was the max MTU supported was 8150. Higher MTU, less IP overhead.
- The 'ring' topology is a little funny. It is only one direction. If there is a cable cut scenario, it will NOT route around properly for certain hosts.
Aside: A cable cut is different than a cable disconnect. The transceiver itself registers whether there is a cable attached or not. When there is no cable present on one side, but is on the other, the above scenario is true (not properly routing.) When both sides of the cable are removed, the ring outright stops and does not work at all. I don't have any data to support an actual cable cut.
The ring works as described in the (scant) documentation, but is as follows from the firmware release notes:
- Received packets from the wire with DMAC equal to the host MAC are forwarded to the local host
- Received traffic from the physical port with DMAC different than the current MAC are forwarded to the other port:
- Traffic can be transmitted by the other physical port
- Traffic can reach functions on the port's Physical Function
- Device allows hosts to transmit traffic only with its permanent MAC
- To prevent loops, the received traffic from the wire with SMAC equal to the port permanent MAC is dropped (the packet cannot start a new loop)
If you run into problems, tcpdump is your friend, and ping is a great little tool to check your sanity.
Hope any of this helps anyone in the future,
I wanted to thank you for this directions they were very helpful. I was successful in linking three nodes together, all running Ubuntu 18.04. I was able to get ~96Gbs in speed between all the host using iperf2. I then took one of the boxes and loaded ESXi 6.7, and configured the same IP address on the two interface I had before. The VMware box can not communicate with the others now. I can communicate through the Nic between the other Ubuntu boxes. When I run a tcpdump on the ESXi I see the ARP request getting created, but get no response. I am wondering if you have any idea why the Chaining feature does not seem to work with ESXi?
I'm glad I helped someone after all the headache I went through for it.
I have no hard experience with VMWare, and so take all of this with a grain of salt.
First thought is vlan tags. I was told that VMWare tags by default.
From my (limited) understanding and thoughts, host chaining inside VMware is not a good idea.
If you setup a virtual switch (on the vmware side) and put both ports of the card on the switch, give that switch an IP, that would allow for vmotion and such over the link at close to line speed. Letting the switch (analogous to openvswitch) do all of the routing, and fast pathing.
Thoughts - If there was host chaining:
Vmware still sees both ports (we can't assign IPs to raw port interfaces to start with.)
It doesn't really know which port to send out, so it could take the extra hop before it gets to the destination.
Three node, desired going from A -> B might take the path of A -> C -> B
Where I can talk is non-chaining speed.
We did try using openswitch and the cards with chaining off. So long as the stp stuff is turned on; we got nearly line speed.
We opened a support ticket for our problems with MTU. It took a while, but we found the problem.
They have a nice little utility (sysinfo-snapshot) for seeing the card internals and OS config options which helped us (by looking through it.)
See my post below. Host_chaining is not supported on ESXi at this time.
Just some due diligence here.
We put our ConnectX5 cards in our 3 host vmware 6.5 stack, and did not get it to work with host_chaining. We ended up contacting support about it, and the reply we got wasn't optimistic.
"Host-chaining is currently not supported as it is not GA for ESXi."
So my previous post was a grain of salt, and marked out accordingly.
I have yet to see *any* documentation on host_chaining specifically; which is really sad, since As far as I know, my post above is the best available.
I went through all those steps, but still the HOST_CHAINING isn't working for me. Any additional ideas I can go for?
What I noticed is: Sending a ping from A to B looks the following. ICMP Request is sent correctly from A to B, but Bs
arp request before sending the ICMP answer moves down the line from B to C and C discards the answer.
For me it looks like the HOST_CHAIN is still not working. But on the same page, I have no glue what to do next.
From what I gather, You might not have host_chaining enabled on C; or you might be using VMWare.
Host chaining is all done on-card, and so the host kernels are not aware of it.
Since chaining works based off of the destination mac; if C doesn't have chaining on; C will see that the packet wasn't meant for it, and not bother replying/rejecting/dropping/forwarding the packet.
With chaining on; the ASIC on the card for C will forward it without sending it to the kernel. The host won't even know that there was a packet to start with.
Something else that I might look at is the arp tables. Could it be possible that with other tests, the table is poisoned? I haven't seen it, but host_chaining is something else...
No, it's turned on and I'm not running ESXi, I'm running Debian 9.5. Here's my setup:
Port1: 172.31.31.11/24 - connected to PVE2 Port2
Port2: 172.31.31.21/24 - connected to PVE4 Port1
root@pve1:~# mlxconfig q | grep HOST_C
Port1: 172.31.31.12/24 - connected to PVE3 Port2
Port2: 172.31.31.22/24 - connected to PVE1 Port1
root@pve2:~# mlxconfig q | grep HOST_C
Port1: 172.31.31.13/24 - connected to PVE4 Port2
Port2: 172.31.31.23/24 - connected to PVE2 Port1
root@pve3:~# mlxconfig q | grep HOST_C
Port1: 172.31.31.14/24 - connected to PVE1 Port2
Port2: 172.31.31.24/24 - connected to PVE3 Port1
root@pve4:~# mlxconfig q | grep HOST_C
Any ideas what I can look into?
Ah; that diagram looks right, all on the same subnet, and all connected in a correct ring.
If I had to take a guess, lower the MTU back to 1500 on all the nodes (both interfaces) `ifconfig ib0 mtu 1500 ; ifconfig ib1 mtu 1500`
We had issues with high MTU throwing host_chaining into a weird packet drop situation; which looks like what might be happening here. They said that it was fixed in a newer FW, but I wasn't able to fully test and make sure it was fixed.
If that doesn't work, I'm out of ideas. Support will give you a script to run on all the nodes; and that's'd be my next action. They have a lot of useful information in that report; so it is worth a look before you send it off.
I've been disappointed with Mellanox with regards to documentation on *any* of this feature.
Disappointment also on my side :-(
But thank you so much for your help.
I have problem to pinging between the nic, this is my configuration:
SERVER 1: PORT1:192.168.10.10 PORT2: 192.168.10.11
SERVER 2: PORT1:192.168.10.12 PORT2: 192.168.10.13
SERVER 3: PORT1: 192.168.10.14 PORT2: 192.168.10.15
mlxconfig -d mt4119-pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2
mlxconfig -d mt4119-pciconf0 set HOST_CHAINING_MODE=1
mlxfwreset --device mt4119_pciconf0 reset
All commands works perfect, but only pingin ports interconnected, i need pinging all ports.
My configuration is correct?
That config looks correct. I'm being that guy... I'd be tempted to do a full machine restart.
Make sure you've issued those commands to the other servers, and done a restart to solidify the config.
I haven't used the mlxfwreset command, but looking at the docs, without the level argument, it is only doing the lowest level of what the adapter supports.
A physical 'shutdown -r now' has always worked for me.
it still does not work
What drivers are you using?