I have a problem with a HA storage solution that has now been sitting in 'development' for a very long time now. We have set up a 2 node Omnios system with 3 disk racks, 2 ZFS pools, 1 each running from each node. RSF-1 has been set up on the system. We are using 2 IS5022 swicthes with the subnet manager running on a dedicated server.
The storage is used to supply 2 Windows 2012 R2 (SQL server cluser) and and ESXi 5.5 cluster. We have tested NFS(IPoIB), iSCSI(IPoIB), and SRP for ESXi, and iSCSI (IPoIB).
Connections come up the system runs, but the problem comes when a failover of the pool from 1 node to the other occurs, the link doesn't always come back up. If the pool was failed back, as soon as its imported and the interfaces are configured again data starts flowing.
The failover takes approx 30 seconds to occur, if there are no VMs running on the datastore then the likely hood of the datastore coming back online appears to be greater. With windows it is also random it coming back up. I have changed (increased the timeouts) on all the systems to ensure that its not dropping the lun (all paths down)
Is there a reason this could be happening, could it be that the subnet manager needs prompting to update the links (i dont know much about what the subnet manager really does and when), could it be a driver issue that the esxi server isnt updating its equivalent arp table or a switch problem?
From what I remember, havent done much work on the system in a while, even after rebooting the esxi servers the datastores arent guaranteed to be available, so doent appear to be esxi or timeout problems on the clients, any tested needed can be done, as I have said this is in development at the moment.
I am going to test the system using just gb ethernet to see how that goes. If that works it is down to infiniband,
Any help on getting this resolved would be greatly appreciated.
Any info that you need I can provide.