0 Replies Latest reply on Sep 20, 2018 9:50 AM by daveb

    ConnectX-4 works at FDR but not FDR10?

    daveb

      I have 2 sets of servers with two tower workstations that are the head nodes, one tower for each server set.  One set is 6 servers and the second is made up of 8 servers.  The two towers are identical, hardware-wise.  All machines use the same dual port ConnectX-4 cards.  The server sets are connected to two Mellanox SX6018 switches with QSFP+ cables (one switch per server set) and the head node towers are connected to an SX6036G with QSFP+ cables.  One port is used on each ConnectX-4 card.  This configuration has been in use for over a year and we've switched between high speed ethernet with RDMA/ ROCE and FDR10 infiniband fabric multiple times with no issues.  We recently switched to FDR infiniband for testing and everything worked fine, but when we switched back to FDR10 the head node towers would no longer pass data (MPI) to the srvers.  We can ping from tower to server over the infiniband and ib_send_bw runs successfully between them with speeds at 38 Gbs, but MPI can't establish a connection from tower to server.  The MPI software works fine from server to server.  The MPI software has not changed from when it worked previously at FDR10 and this configuration works flawlessly when set to FDR but it does not work at the slower FDR10 configuration.  The switches are set to auto-negotiate the fabric speed and switch reboots have not helped.  Our customer dictates that we use FDR10 so we need to get this back and working at FDR10.  Any suggestions?