1 Reply Latest reply on Mar 2, 2017 8:26 AM by eddie.notz

    Not getting connection anymore between 4036 switch and hosts

    delaplj

      Dear Support,

      We're running a 35 nodes cluster which was working perfectly for several years but recently, one of our admin ran the HP upgrade tool and it has updated the firmware of the infiniband card of the servers

      Since then most of the links are not getting established and remain in Polling, however one of the nodes seems to be able to connect even with this new firmware.

      We're struggling diagnosing the issue (is this really the firmware upgrade, can we rollback, should be upgrade the switches, etc.), and how to address it without changing the whole setup (drivers, os, firmware,...)

      Rebooting the switches is having no effect, swapping cables makes the server properly connecting via the other port so it seems connected to the nodes themselves

       

      Some hints would be greatly appreciated

       

      -jd

       

      Working node with old firmware:

      [root@***s29 ~]# ibstat

      CA 'mlx4_0'

              CA type: MT4099

              Number of ports: 2

              Firmware version: 2.10.2350

              Hardware version: 0

              Node GUID: **

              System image GUID: ***

              Port 1:

                      State: Active

                      Physical state: LinkUp

                      Rate: 40

                      Base lid: 35

                      LMC: 0

                      SM lid: 2

                      Capability mask: **

                      Port GUID: ***

                      Link layer: InfiniBand

       

      We're using the following cards (from HP)

      [root@****~]# lspci | grep Mell

      07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

      [root@***s03 ~]# ibstat

      CA 'mlx4_0'

              CA type: MT4099

              Number of ports: 2

              Firmware version: 2.36.5000

              Hardware version: 0

              Node GUID: **

              System image GUID: **

              Port 1:

                      State: Down

                      Physical state: Polling

                      Rate: 40

                      Base lid: 0

                      LMC: 0

                      SM lid: 0

                      Capability mask: **

                      Port GUID: **

                      Link layer: InfiniBand

       

      ***# module-firmware show

       

      Module No.      Type            Node GUID             LID   FW Version  SW Version

      ----------      ----            ---------             ---   ----------  ----------

      4036/2036                                                               3.6.2-872