22 Replies Latest reply on Sep 9, 2013 7:39 AM by michael.blanchard

    New to infiniband, can't get a working connection.

      I have 4x mellanox infinihost III cards, 1x mellanox connectx card that was new out of a sealed box, and a cisco topspin 120.  I cannot get a single infinband link light on the switch to light up.  I can go into the topspin interface and they are all enabled, but show no link light when plugged in.  If I plug the cards into eachother, no link lights either.  I have 8x SDR cables and I get the same result from each of them.  I also had a qlogic silverstorm switch before that showed the same problem, so now i'm forced to consider one or more of the following:

      1. I have had 2 bad switches in a row.

      2. I have 8x bad infiniband cables

      3. I have 5x bad mellanox cards

      4. I'm a moron and should switch to fibre channel.

       

      Is there anything I might be missing?  I'm sick of sending back switches that might not be bad.

        • Re: New to infiniband, can't get a working connection.

          I, too, am new to infiniband.

          So this might sound silly, but did you run OpenSM on any of the connected hosts?

          Without OpenSM (or another fabric manager; I do not know if the infiniband switches you tried have one built in) no adapters can use the connection.

           

          OpenSM is included in the OFED package that you can download from Mellanox or from OpenFabrics Alliance.

          Many linux distributions have an OFED package that can be installed as well.

          • Re: New to infiniband, can't get a working connection.

            I'll try that, the topspin has a subnet manager built in, but i'll try that also.  It's very annoying how I can't get any lights at all.  Even plugging the cards into eachother.  So I either have 8x bad cables, 5x bad adapters or 2x bad switches.

              • Re: New to infiniband, can't get a working connection.
                justinclift

                Hi Michael,

                 

                Are you still having problems with this?

                 

                Asking because I used to use (a few years ago) pretty much the same kit.  If you're ok with Linux (RHEL or CentOS), I can give you close to step by step instructions to get it working.  Might take a bit of communication back and forth, but you're in good hands.

                 

                 

                 

                (note - minor edits for clarity)

                  • Re: New to infiniband, can't get a working connection.

                    yes, still having problems.  since original post, i'm now on my 3rd switch, and the second topspin 120 and i'm having the exact same issue.  While I can plug in two of the infinihost IIIs together and get a link light, when I plug them into the switch, I get no link light.  I also cannot plug the connectx card into either and get it to work, but i'm starting to suspect that it's just a bad card.  I just can't believe one person can have this much trouble with this stuff.

                      • Re: Re: New to infiniband, can't get a working connection.
                        justinclift

                        No worries.  Which OS are you using?

                         

                        Is there any chance you could do stuff on CentOS/RHEL 6.4?

                         

                        Asking that because it's what I'm super familiar with.

                         

                        If you're ok with that, please install the CentOS/RHEL provided IB software, and also pciutils:

                         

                        $ sudo yum groupinstall "Infiniband Support"

                        $ sudo yum install mstflint pciutils

                        $ sudo chkconfig rdma on

                        $ sudo service rdma start

                         

                        Then let's do some basic info gathering so we know what we're dealing with.

                         

                        • Run lspci -Qvvs on the ConnectX card, and at least one of the Infinihost III's, then post the results here
                        • Also query the firmware of both using mstflint

                         

                        Example from a ConnectX card here.  First I find out it's PCI address in the box:

                         

                        $ sudo lspci |grep Mell

                        01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

                         

                        Then use lspci -Qvvs on that address, to retrieve all of the potentially useful info:


                        $ sudo lspci -Qvvs 01:00.0

                        01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

                            Subsystem: Mellanox Technologies Device 0006

                            Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

                            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

                            Latency: 0, Cache Line Size: 64 bytes

                            Interrupt: pin A routed to IRQ 16

                            Region 0: Memory at f7c00000 (64-bit, non-prefetchable) [size=1M]

                            Region 2: Memory at f0000000 (64-bit, prefetchable) [size=8M]

                            Capabilities: [40] Power Management version 3

                                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

                                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

                            Capabilities: [48] Vital Product Data

                                Product Name: Eagle DDR

                                Read-only fields:

                                    [PN] Part number: 375-3549-01         

                                    [EC] Engineering changes: 51

                                    [SN] Serial number: 1388FMH-0905400010     

                                    [V0] Vendor specific: PCIe x8        

                                    [RV] Reserved: checksum good, 0 byte(s) reserved

                                Read/write fields:

                                    [V1] Vendor specific: N/A  

                                    [YA] Asset tag: N/A                            

                                    [RW] Read-write area: 111 byte(s) free

                                End

                            Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-

                                Vector table: BAR=0 offset=0007c000

                                PBA: BAR=0 offset=0007d000

                            Capabilities: [60] Express (v2) Endpoint, MSI 00

                                DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

                                    ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+

                                DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

                                    RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

                                    MaxPayload 256 bytes, MaxReadReq 512 bytes

                                DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

                                LnkCap:    Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited

                                    ClockPM- Surprise- LLActRep- BwNot-

                                LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

                                    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

                                LnkSta:    Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

                                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

                                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

                                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

                                     Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

                                     Compliance De-emphasis: -6dB

                                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

                                     EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

                            Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

                                ARICap:    MFVC- ACS-, Next Function: 1

                                ARICtl:    MFVC- ACS-, Function Group: 0

                            Kernel driver in use: mlx4_core

                            Kernel modules: mlx4_core

                         

                        Note the blue highlighted bits.  For ConnectX cards this stuff is useful.   For my card, it's showing a Sun part number, as it was originally a Sun badged card (now reflashed to stock firmware).  The PCI link is in x8 state too, which is useful (if it wasn't, it would indicate a problem).

                         

                        And the mstflint output example:

                         

                        $ sudo mstflint -d 01:00.0 q

                        Image type:      ConnectX

                        FW Version:      2.9.1000

                        Device ID:       25418

                        Description:     Node             Port1            Port2            Sys image

                        GUIDs:           0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

                        MACs:                                 0003ba00edb9     0003ba00edba

                        Board ID:         (MT_04A0120002)

                        VSD:            

                        PSID:            MT_04A0120002

                         

                        That tells us the firmware version on the card.  Useful to know, as it might need upgrading (very easy to do).

                         

                        After you've pasted that info here, we can start figuring out if there's anything wrong with the basics first and fix them.  Then we can move onto the next stuff.

                         

                        (note - edited for typo fixes)

                          • Re: Re: New to infiniband, can't get a working connection.

                            Sorry for the lateness, i've been busy.  I'm running 1 vmware host, 2 windows hosts and a linux san host.  the linux box is running software called "esos" which busybox linux with all the SCST software pre-installed.  It's frustrating because the software shows the HCA no problem.  Plug it into another card, lights up, no problem.  Plug it into a cisco topspin...no light.  I have to check with the software vendor as since ESOS runs busybox I don't have a mstflint or lspci on it, this is what's taking me some time.

                              • Re: Re: New to infiniband, can't get a working connection.
                                justinclift

                                No worries.  I was wondering what happened.

                                 

                                I'm kind of wondering if the subnet manager in the Cisco/Topspin box is too ancient to properly recognise your cards.

                                 

                                Are you able to get another Linux box up and running temporarily?  If so, it would be interesting to see what happens if you turn off the subnet manager in the Cisco/Topspin box, instead running your own (OpenSM) on another box connected to the switch.

                                 

                                No idea if that's practical for you to try out though...

                              • Re: New to infiniband, can't get a working connection.

                                [root@localhost ~]# lspci -Qvvs 01:00.0

                                01:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0 5GT/s - IB DDR / 10GigE] (rev a0)

                                        Subsystem: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0 5GT/s - IB DDR / 10GigE]

                                        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+

                                        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

                                        Latency: 0, Cache Line Size: 32 bytes

                                        Interrupt: pin A routed to IRQ 24

                                        Region 0: Memory at feb00000 (64-bit, non-prefetchable) [size=1M]

                                        Region 2: Memory at f9000000 (64-bit, prefetchable) [size=8M]

                                        Capabilities: [40] Power Management version 3

                                                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

                                                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

                                        Capabilities: [48] Vital Product Data

                                                Product Name: Eagle DDR

                                                Read-only fields:

                                                        [PN] Part number: 46M2220             

                                                        [EC] Engineering changes: A1

                                                        [SN] Serial number: YK5020000771           

                                                        [V0] Vendor specific: PCIe Gen2 x8   

                                                        [RV] Reserved: checksum good, 0 byte(s) reserved

                                                Read/write fields:

                                                        [V1] Vendor specific: N/A  

                                                        [YA] Asset tag: N/A                            

                                                        [RW] Read-write area: 111 byte(s) free

                                                End

                                        Capabilities: [9c] MSI-X: Enable+ Count=256 Masked-

                                                Vector table: BAR=0 offset=0007c000

                                                PBA: BAR=0 offset=0007d000

                                        Capabilities: [60] Express (v2) Endpoint, MSI 00

                                                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

                                                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-

                                                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

                                                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-

                                                        MaxPayload 128 bytes, MaxReadReq 512 bytes

                                                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

                                                LnkCap: Port #8, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited

                                                        ClockPM- Surprise- LLActRep- BwNot-

                                                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

                                                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

                                                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

                                                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

                                                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

                                                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

                                                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

                                                         Compliance De-emphasis: -6dB

                                                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

                                                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

                                        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

                                                ARICap: MFVC- ACS-, Next Function: 1

                                                ARICtl: MFVC- ACS-, Function Group: 0

                                        Kernel driver in use: mlx4_core

                                  • Re: New to infiniband, can't get a working connection.

                                    I updated firmware, tried the cisco sunet manager and openSM, still no link

                                     

                                    [root@localhost ~]# ibstat

                                    CA 'mlx4_0'

                                            CA type: MT26418

                                            Number of ports: 2

                                            Firmware version: 2.6.648

                                            Hardware version: a0

                                            Node GUID: 0x0002c90300074bd0

                                            System image GUID: 0x0002c90300074bd3

                                            Port 1:

                                                    State: Down

                                                    Physical state: Polling

                                                    Rate: 10

                                                    Base lid: 0

                                                    LMC: 0

                                                    SM lid: 0

                                                    Capability mask: 0x0251086a

                                                    Port GUID: 0x0002c90300074bd1

                                                    Link layer: InfiniBand

                                            Port 2:

                                                    State: Down

                                                    Physical state: Polling

                                                    Rate: 10

                                                    Base lid: 0

                                                    LMC: 0

                                                    SM lid: 0

                                                    Capability mask: 0x02510868

                                                    Port GUID: 0x0002c90300074bd2

                                                    Link layer: InfiniBand

                                      • Re: New to infiniband, can't get a working connection.
                                        yairi

                                        Both ports showing Down/Polling.

                                        you need to check your physical wires/connections. you might using the wrong combinations of HCA/Cable/Switch.

                                         

                                        what do you have there for cables?

                                         

                                        try the following:

                                         

                                        1) run opensm on your switch and take a cable - loopback between two ports, see if the link comes up (if not, then the cable is no good)

                                        2) run opensm daemon on the server and loopback between the HCA port 1 and 2 - same as the above.

                                          • Re: New to infiniband, can't get a working connection.

                                            1. I have really heavy duty intel infiniband cx4 cables. 

                                             

                                            1. With subnet manager enabled on cisco switch, I see same activity as listed above, I took that info with openSM running and both ports plugged into the switch and the cisco subnet manager disabled.

                                            1.  I enabled the cisco subnet manager and plugged the same cable into two ports, no connection lights.

                                            2.  I take same cable and run it between both ports on the HCA  and connections light up.

                                             

                                            It does appear that i'm now stuck with a dead infiniband switch, or some subnet manager weirdness?

                                              • Re: New to infiniband, can't get a working connection.
                                                justinclift

                                                Interesting.  That shows you've tested the cables, and they're ok, so it's not that.

                                                 

                                                I wonder if the switch(es) themselves have been configured by their last owner to some non-standard settings?

                                                Maybe some kind of weird/unusual port speed combination or similar.

                                                 

                                                Three thoughts at this point:

                                                 

                                                • Would you be ok to paste the output of mstflint query command, so we can see any PSID and/or firmware info for one of the cards?

                                                • If you haven't already, look for a way to reset the switch itself to "factory defaults".  May or may not exist.  No idea, but worth a shot just in case it really is "strange settings have been applied".

                                                 

                                                • Try forcing the cards to very slow port speeds.  Something equivalent to 10Gb/s, and see if that works.  Actually, I'd probably try manually forcing the cards to all of the port speeds, one at a time, and see if any of them link up through the switch.  Again, just in case the switch isn't negotiating properly for some reason.

                                                 

                                                Side note - I'm on holiday this week, so likely only limited help from me at this time.

                                • Re: New to infiniband, can't get a working connection.
                                  yairi

                                  with all your issues we need to start sort; get the basics working first and move forward with the others.

                                  you need to start with:

                                  A) Get physical links working (green LED)

                                  B) Get a subnet manager going and make sure logical links are also working (amber LED)

                                  C) upgrade all components firmwares to the latest possible (switches and HCAs)

                                   

                                  once those done, then move to other thiings (protocols, ibping, vmware, your app, etc)

                                    • Re: New to infiniband, can't get a working connection.

                                      OK...so i've made some more progress. 

                                      A. I can get connections with the linux hosts with all cables, with IP addresses assigned i can ping other computers over the IB fabric.  Still can't get IBping to work, I think i'm using wrong GUID or something.  I also found some of my issues were due to a faulty card.

                                      B. Subnet manager is working.  Previous issues with subnet manager disappeared after a reboot, possibly related to fault card from A

                                      C. My connectx card was upgraded to a new firmware, will try it today to at least get two machines on fabric.   Main issue now seems to be the fact I bought mellanox infinihost III cards, which apparently aren't supported anymore under server 2012.  I've got new connect-x cards coming that I will hopefully get better luck with.  So now i'm dead in the water until those arrive.

                                    • Re: New to infiniband, can't get a working connection.
                                      yairi

                                      regarding the combination Win Server 2012 + HCA FW - there is a dependency there. you have to be on the latest FW for ConnectX in order for the link to come up.

                                      for ConnectX/2 you need to be around 2.9.1000 or higher. for ConnectX3 you should be on 2.11.XXXX or 2.30.XXXX