1 Reply Latest reply on Dec 17, 2015 4:14 AM by gertux

    Infiniband SX6036G/SX6018F and QLogic HP BLc 4X QDR IB Switch

    gertux

      Hi all,

       

      I'm really new to IB, and I'm having some issues while trying to configure my existing IB network with SX6036G gw and SX6018F switches to a new HP Enclosure with QLogic HP BLc 4X QDR IB Switch and InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02) mezzanine adapters on each of the Blades. Here's my topology:

       

      # ibswitches

      Switch    : 0x0002c902004b0918 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 29 lmc 0     --> QLogic HP BLc 4X QDR IB Switch

      Switch    : 0xe41d2d030031e9c1 ports 37 "MF0;GWIB01:SX6036G/U1" enhanced port 0 lid 24 lmc 0

      Switch    : 0xf45214030073f500 ports 18 "MF0;SWIB02:SX6018/U1" enhanced port 0 lid 1 lmc 0

      Switch    : 0xe41d2d030031eb41 ports 37 "MF0;GWIB02:SX6036G/U1" enhanced port 0 lid 23 lmc 0

      Switch    : 0xe41d2d0300097630 ports 18 "MF0;SWIB01:SX6018/U1" enhanced port 0 lid 2 lmc 0

       

      The SM is running on switch SWIB01 with priority 8.

       

      The thing comes when I try to configure the blades, they had Ubuntu 14.04.3 LTS with the following modules:

       

      ib_ucm

      ib_uverbs

      ib_ipoib

      ib_cm

      ib_sa

      ib_umad

      ib_mthca

      ib_qib

      ib_mad

      ib_core

      ib_addr

      dca

       

      If I ran an "ibstat" from one of the Blades I'm getting:

       

      root@ubuntu:~# ibstat

      CA 'qib0'

          CA type: InfiniPath_QMH7342

          Number of ports: 2

          Firmware version:

          Hardware version: 2

          Node GUID: 0x0011750000791fec

          System image GUID: 0x0011750000791fec

          Port 1:

              State: Down

              Physical state: Polling

              Rate: 40

              Base lid: 30

              LMC: 0

              SM lid: 2

              Capability mask: 0x0761086a

              Port GUID: 0x0011750000791fec

              Link layer: InfiniBand

          Port 2:

              State: Down

              Physical state: Polling

              Rate: 40

              Base lid: 65535

              LMC: 0

              SM lid: 65535

              Capability mask: 0x0761086a

              Port GUID: 0x0011750000791fed

              Link layer: InfiniBand

       

      Ok, now If I go to a host that's inside of the IB network and run the following commands, I'm able to 'active' the port just for a while..:

       

      # ibportstate -L 29 28 disable

      # ibportstate -L 29 28 speed 4

      # ibportstate -L 29 28 espeed 4

      # ibportstate -L 29 28 smlid 2

      # ibportstate -L 29 28 enable

       

      # ibportstate -L 29 28

      Switch PortInfo:

      # Port info: Lid 29 port 28

      LinkState:.......................Active

      PhysLinkState:...................LinkUp

      Lid:.............................75

      SMLid:...........................2328

      LMC:.............................0

      LinkWidthSupported:..............1X or 4X

      LinkWidthEnabled:................1X or 4X

      LinkWidthActive:.................4X

      LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps

      LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps

      LinkSpeedActive:.................10.0 Gbps

      Peer PortInfo:

      # Port info: Lid 29 DR path slid 4; dlid 65535; 0,28 port 1

      LinkState:.......................Active

      PhysLinkState:...................LinkUp

      Lid:.............................30

      SMLid:...........................2

      LMC:.............................0

      LinkWidthSupported:..............1X or 4X

      LinkWidthEnabled:................1X or 4X

      LinkWidthActive:.................4X

      LinkSpeedSupported:..............10.0 Gbps (IBA extension)

      LinkSpeedEnabled:................10.0 Gbps (IBA extension)

      LinkSpeedActive:.................10.0 Gbps

      Mkey:............................<not displayed>

      MkeyLeasePeriod:.................0

      ProtectBits:.....................0

       

       

      On the Blade host:

       

      root@ubuntu:~# ibstat

      CA 'qib0'

          CA type: InfiniPath_QMH7342

          Number of ports: 2

          Firmware version:

          Hardware version: 2

          Node GUID: 0x0011750000791fec

          System image GUID: 0x0011750000791fec

          Port 1:

              State: Active

              Physical state: LinkUp

              Rate: 40

              Base lid: 30

              LMC: 0

              SM lid: 2

              Capability mask: 0x0761086a

              Port GUID: 0x0011750000791fec

              Link layer: InfiniBand

          Port 2:

              State: Down

              Physical state: Polling

              Rate: 40

              Base lid: 65535

              LMC: 0

              SM lid: 65535

              Capability mask: 0x0761086a

              Port GUID: 0x0011750000791fed

              Link layer: InfiniBand

       

       

      But then in any moment it got Down again and lost connectivity,

       

      If I run a "ibqueryerrors" on the host that work fine I'm getting the following:

       

      # ibqueryerrors

      Errors for "Intel Infiniband HCA ubuntu"

         GUID 0x11750000791fec port 1: [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 132] [PortRcvErrors == 8]

      Errors for 0x2c902004b0918 "Infiniscale-IV Mellanox Technologies"

         GUID 0x2c902004b0918 port ALL: [SymbolErrorCounter == 65535] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4] [PortXmitDiscards == 1]

         GUID 0x2c902004b0918 port 1: [PortXmitDiscards == 1]

         GUID 0x2c902004b0918 port 2: [LinkErrorRecoveryCounter == 1] [LinkDownedCounter == 1]

         GUID 0x2c902004b0918 port 28: [SymbolErrorCounter == 65535] [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 255] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4]

      Errors for 0xe41d2d030031e9c1 "MF0;GWIB01:SX6036G/U1"

         GUID 0xe41d2d030031e9c1 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 1485] [PortXmitWait == 87808]

         GUID 0xe41d2d030031e9c1 port 0: [PortXmitWait == 87808]

         GUID 0xe41d2d030031e9c1 port 9: [SymbolErrorCounter == 1] [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 1485]

         GUID 0xe41d2d030031e9c1 port 10: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]

         GUID 0xe41d2d030031e9c1 port 33: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031e9c1 port 34: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031e9c1 port 35: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031e9c1 port 36: [LinkDownedCounter == 1]

      Errors for 0xf45214030073f500 "MF0;SWIB02:SX6018/U1"

         GUID 0xf45214030073f500 port ALL: [LinkDownedCounter == 2] [PortXmitWait == 6380344]

         GUID 0xf45214030073f500 port 0: [PortXmitWait == 14354]

         GUID 0xf45214030073f500 port 4: [PortXmitWait == 1514987]

         GUID 0xf45214030073f500 port 5: [PortXmitWait == 1569766]

         GUID 0xf45214030073f500 port 6: [PortXmitWait == 1620863]

         GUID 0xf45214030073f500 port 7: [PortXmitWait == 1660374]

         GUID 0xf45214030073f500 port 16: [LinkDownedCounter == 1]

         GUID 0xf45214030073f500 port 18: [LinkDownedCounter == 1]

      Errors for 0xe41d2d030031eb41 "MF0;GWIB02:SX6036G/U1"

         GUID 0xe41d2d030031eb41 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 2047] [PortXmitWait == 103260]

         GUID 0xe41d2d030031eb41 port 0: [PortXmitWait == 103260]

         GUID 0xe41d2d030031eb41 port 9: [LinkDownedCounter == 3] [PortRcvRemotePhysicalErrors == 2047]

         GUID 0xe41d2d030031eb41 port 33: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031eb41 port 34: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031eb41 port 35: [LinkDownedCounter == 1]

         GUID 0xe41d2d030031eb41 port 36: [LinkDownedCounter == 1]

      Errors for "cibosd08 HCA-1"

         GUID 0xe41d2d03007b77c1 port 1: [PortXmitWait == 3387]

         GUID 0xe41d2d03007b77c2 port 2: [PortXmitWait == 3351]

      Errors for "cibosd07 HCA-1"

         GUID 0xe41d2d03007b67c1 port 1: [PortXmitWait == 3165]

         GUID 0xe41d2d03007b67c2 port 2: [PortXmitWait == 3364]

      Errors for "cibosd06 HCA-1"

         GUID 0xe41d2d03007b77b1 port 1: [PortXmitWait == 2962]

         GUID 0xe41d2d03007b77b2 port 2: [PortXmitWait == 3259]

      Errors for "cibosd05 HCA-1"

         GUID 0xe41d2d0300d95191 port 1: [PortXmitWait == 3213]

         GUID 0xe41d2d0300d95192 port 2: [PortXmitWait == 4189]

      Errors for "cibosd04 HCA-1"

         GUID 0xf45214030095a6f1 port 1: [PortRcvRemotePhysicalErrors == 595] [PortXmitWait == 1861]

         GUID 0xf45214030095a6f2 port 2: [PortXmitWait == 698289]

      Errors for "cibosd03 HCA-1"

         GUID 0xf45214030095ad91 port 1: [PortRcvRemotePhysicalErrors == 501] [PortXmitWait == 2317]

         GUID 0xf45214030095ad92 port 2: [PortXmitWait == 734853]

      Errors for "cibosd01 HCA-1"

         GUID 0xf45214030095a701 port 1: [PortRcvRemotePhysicalErrors == 860] [PortXmitWait == 1975]

         GUID 0xf45214030095a702 port 2: [PortXmitWait == 1459727]

      Errors for "cibosd02 HCA-1"

         GUID 0xf45214030095a6c1 port 1: [PortRcvRemotePhysicalErrors == 540] [PortXmitWait == 2282]

         GUID 0xf45214030095a6c2 port 2: [PortXmitWait == 1080397]

      Errors for "cibmon03 HCA-1"

         GUID 0xe41d2d0300163631 port 1: [PortXmitWait == 219]

      Errors for "cibmon02 HCA-1"

         GUID 0xe41d2d0300163a61 port 1: [PortXmitWait == 24887]

      Errors for 0xe41d2d0300097630 "MF0;SWIB01:SX6018/U1"

         GUID 0xe41d2d0300097630 port ALL: [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 2912] [PortRcvSwitchRelayErrors == 248] [PortXmitWait == 62134]

         GUID 0xe41d2d0300097630 port 0: [PortXmitWait == 27162]

         GUID 0xe41d2d0300097630 port 1: [PortRcvSwitchRelayErrors == 16]

         GUID 0xe41d2d0300097630 port 2: [PortRcvSwitchRelayErrors == 23]

         GUID 0xe41d2d0300097630 port 3: [PortRcvSwitchRelayErrors == 21] [PortXmitWait == 34972]

         GUID 0xe41d2d0300097630 port 4: [PortRcvSwitchRelayErrors == 53]

         GUID 0xe41d2d0300097630 port 5: [PortRcvSwitchRelayErrors == 76]

         GUID 0xe41d2d0300097630 port 6: [PortRcvSwitchRelayErrors == 30]

         GUID 0xe41d2d0300097630 port 7: [PortRcvSwitchRelayErrors == 29]

         GUID 0xe41d2d0300097630 port 16: [LinkDownedCounter == 1] [PortRcvRemotePhysicalErrors == 1673]

         GUID 0xe41d2d0300097630 port 17: [LinkDownedCounter == 1]

         GUID 0xe41d2d0300097630 port 18: [PortRcvRemotePhysicalErrors == 1239]

      Errors for "cibmon01 HCA-1"

         GUID 0xe41d2d0300163651 port 1: [PortXmitWait == 4071]

       

      ## Summary: 19 nodes checked, 17 bad nodes found

      ##          171 ports checked, 54 ports have errors beyond threshold

      ##

      ## Suppressed:

       

      Any ideas? I've already try to setup the port speed at "7" but with no luck at all, in fact it also does not come Up, just with speed "4"

       

      Thanks in advance,

       

      Cheers,

       

      German

        • Re: Infiniband SX6036G/SX6018F and QLogic HP BLc 4X QDR IB Switch
          gertux

          Well after some time, I've spoke with the HP people and they changed the Mezzanine cards from (QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02)) to (Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)) and now it works, not at 3.2Gb/s (the throughput that I think QDR may get), but both ports come up and they are running fine. After the replacement of the mezzanine cards, also I do a firmware upgrade code and took it to the last available code from the HP support site.

           

          $ ibstat

          CA 'mlx4_0'

              CA type: MT26428

              Number of ports: 2

              Firmware version: 2.9.1530

              Hardware version: b0

              Node GUID: 0xf452140300dd3294

              System image GUID: 0xf452140300dd3297

              Port 1:

                  State: Active

                  Physical state: LinkUp

                  Rate: 40

                  Base lid: 32

                  LMC: 0

                  SM lid: 2

                  Capability mask: 0x02510868

                  Port GUID: 0xf452140300dd3295

                  Link layer: InfiniBand

              Port 2:

                  State: Active

                  Physical state: LinkUp

                  Rate: 40

                  Base lid: 33

                  LMC: 0

                  SM lid: 2

                  Capability mask: 0x02510868

                  Port GUID: 0xf452140300dd3296

                  Link layer: InfiniBand