3 Replies Latest reply on May 16, 2018 12:04 PM by halr

    Can't get opensm to work

    cjf001

      Hi All - have a question about infiniband and subnet manager, using the Mellanox OFED open source packages.

       

      It's the classic "it worked before the upgrade, but doesn't now" problem.

       

      Here's the "before" configuration :

       

      The hardware is an HP C7000 blade enclosure, with 16 servers/blades installed. These 16 servers are identical ProLiant BL460c Gen8 blades, each with an infiniband card installed (lspci shows "21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]" for the infiniband card). The blade chassis has an integrated infiniband switch "HP BLc 4X QDR IB Switch". This switch does not run a subnet manager.

       

      Each of the 16 servers was running Centosv5.9, with the Mellanox OFED version 1.5.3 infiniband packages, which includes OpenSM version 3.3.13. The first blade (we'll call it "01") was running the opensm daemon. All was well, all worked, all was good.

       

      Then, I began upgrading the servers to Centosv6.9. This change caused me to have to "upgrade" the Mellanox OFED software to version 4.0.2, which includes the OpenSM version 4.9.0. I did a few servers at a time, but left server "01" to the last, since it was running the subnet manager. Finally, I'm at today's state, which is node "01" still at Centosv5.9, still running the "old" subnet manager, and all the other nodes updated to Centosv6.9 and the newer OFED packages. There was no change to any of the hardware in the blade chassis at all. All is still working well.

       

      Now, before I update the final "01" node, I need to make sure that the new opensm will work. So I stopped the opensm on node "01", and started it on one of the updated nodes - "02". opensm starts, and runs, but does not seem to be working - that is, none of the other nodes' infiniband ports go to "Active" (as viewed by the ibstat command) - they stay in the "Init" state, and there is no connectivity between them.

       

      If I stop the opensm on node "02" and restarted the old one on node "01", all the other nodes' ports go right to "Active" state and all is well.

       

      I'm not an infiniband expert by any means - I use it in our HPC cluster and I know the basics, but it usually just works and I've not had to dive into the depths of it before. I've tried a few things based on some Google searches I've done, but no luck do far. So, I need some ideas on what to check or try - maybe this particular version is a lemon ? Maybe there's a config that needs to be set with the newer versions ? Maybe it's a driver issue ? (although I'm not sure how that could be, since the infiniband works fine on the newer OS and OFED installs).

       

      All suggestions welcome !

       

           Thanks,

       

                John

        • Re: Can't get opensm to work
          halr

          Hi John,

           

          There are quite a few differences between OpenSM 3.3.13 (upstream) and 4.9.0 (MLNX) but the newer one should work. I'll try to help and have a few basic questions to get started on figuring out what is going wrong:

           

          Is the same opensm.conf file being used for these 2 versions ? What routing engine is being used ? Were any edits to the 3.3.13 one made that would need to be applied to the 4.9.0 one ?

           

          Are there any errors in the log file for OpenSM 4.9.0 ? What are they ?

           

          Thanks.

           

          -- Hal

            • Re: Can't get opensm to work
              cjf001

              Hi Hal - thanks for the reply, sorry my response is delayed so much, had to be out last week.....

               

              Anyway, there was no opensm.config file for either of the versions when I started. I did use the "-c" option to create a config file on the old opensm system, and then I copied that to the new opensm system, but that didn't help.

               

              I don't know what a "routing engine" is in the infiniband context - I don't remember ever having to set up anthing like that.

               

              As far as errors in the opensm.log file, there are some, but I can't vouch for what the configs were at those times, so I'll probably have to set up for more testing now that I'm back, and do some "controlled" tests again. I have to wait until there are no jobs running on this blade system to make my tests, since the infiniband net kind of goes down when I stop the old opensm process

               

              Thanks, and let me know if you have any other thoughts or suggestions -

               

                     John

                • Re: Can't get opensm to work
                  halr

                  routing engine is configured in opensm.conf with the following lines:

                   

                  # Routing engine

                  # Multiple routing engines can be specified separated by

                  # commas so that specific ordering of routing algorithms will

                  # be tried if earlier routing engines fail.

                  # Supported engines: minhop, updn, dnup, file, ftree, lash,

                  #    dor, torus-2QoS, nue, dfsssp, sssp

                  routing_engine (null)

                   

                  null would cause minhop to be used.

                   

                  There is also command line option (-R) that overrides this.

                   

                   

                  You might want to try generating a new opensm.conf on the MLNX OpenSM system which would be different from the one generated by the old opensm system and see if that makes any difference. There are many new options available. While the old conf file should be compatible and work, I just want to eliminate that possibility.

                   

                  Can you provide examples of some of the error messages in the opensm log ?

                   

                  Thanks.

                   

                  -- Hal