7 Replies Latest reply on Jul 30, 2013 10:10 PM by eddie.notz

    Proper Configuration for IB-FDR and RoCE

      Hello,

       

      we have a few large clusters which came with Mellanox dual-port HCAs (QDR+10GigE). Initially the clusters were setup as RoCE clusters but now we have acquired and continue acquiring IB FDR fabric infrastructure.

       

      One the cluster with the dual-port QDR+10GigE some MPI stacks (OpemMPI 1.6.5 or 1.7.2 and IntelMPI 4.1.1) started getting confused with communication stalling at times completely.

       

      When I do an $ ibstatus I am getting

       

      $ ibstatus

      Infiniband device 'mlx4_0' port 1 status:

              default gid:     fe80:0000:0000:0000:78e7:d103:0023:91ad

              base lid:        0x1

              sm lid:          0x21

              state:           4: ACTIVE

              phys state:      5: LinkUp

              rate:            40 Gb/sec (4X QDR)

              link_layer:      InfiniBand

       

       

      Infiniband device 'mlx4_0' port 2 status:

              default gid:     fe80:0000:0000:0000:7ae7:d1ff:fe23:91ad

              base lid:        0x0

              sm lid:          0x0

              state:           4: ACTIVE

              phys state:      5: LinkUp

              rate:            10 Gb/sec (1X QDR)

              link_layer:      Ethernet

       

      When both ports are configured, is there any special setting so that both 10GigE/RoCE and IB parts work without interfering one with the other? Do I need to setup opensm which manages the IB part to only use the IB port for IB fabric management? Can you please suggest any guidelines for this situation with both ports being configured?

       

      Is there any adverse effect having BOTH RoCE and IB operating on a cluster at the same time?

       

      Systems run RHEL 6.3 using the stock OFED and opensm that came with RHEL 6.3.

       

      uname -a :

      Linux host 2.6.32-279.25.2.el6.x86_64 #1 SMP Tue May 14 16:19:07 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

       

       

       

      thanks ....

       

      Michael

        • Re: Proper Configuration for IB-FDR and RoCE

          Hello Michael,

           

          From what I see the configuration you are describing should be workable.  When you described your issue you said that the connections became confused and stalled at times.  I feel you were implying below this that the issue came from using both InfiniBand and Ethernet on one card.

           

          Could you elaborate on this error?

          Is there a specific output you are seeing?

          What is the traffic like on these ports during this error condition?

          Is it when both cards are putting traffic on egress on both ports, or ingress, or a mix?

           

          I noticed you are using the RHEL6.3 Community OFED, have you tried your success with our driver?

          Could you provide the ibv_devinfo output for your machines?  I would like to see the PSID of your cards.

            • Re: Proper Configuration for IB-FDR and RoCE

              Hi Lui,

               

              thanks for the reply!

               

              Here is the  ibstatus and ibv_devinfo output :

               

              $ ibstatus

              Infiniband device 'mlx4_0' port 1 status:

                      default gid:     fe80:0000:0000:0000:24be:05ff:ff91:fee1

                      base lid:        0x1c

                      sm lid:          0x12

                      state:           4: ACTIVE

                      phys state:      5: LinkUp

                      rate:            56 Gb/sec (4X FDR)

                      link_layer:      InfiniBand

               

              Infiniband device 'mlx4_0' port 2 status:

                      default gid:     fe80:0000:0000:0000:26be:05ff:fe91:fee2

                      base lid:        0x0

                      sm lid:          0x0

                      state:           4: ACTIVE

                      phys state:      5: LinkUp

                      rate:            10 Gb/sec (1X QDR)

                      link_layer:      Ethernet

               

               

              *$ ibv_devinfo *

              hca_id: mlx4_0

                      transport:                      InfiniBand (0)

                      fw_ver:                         2.11.1008

                      node_guid:                      24be:05ff:ff91:fee0

                      sys_image_guid:                 24be:05ff:ff91:fee3

                      vendor_id:                      0x02c9

                      vendor_part_id:                 4099

                      hw_ver:                         0x0

                      board_id:                       HP_0230240019

                      phys_port_cnt:                  2

                              port:   1

                                      state:                  PORT_ACTIVE (4)

                                      max_mtu:                2048 (4)

                                      active_mtu:             2048 (4)

                                      sm_lid:                 18

                                      port_lid:               28

                                      port_lmc:               0x00

                                      link_layer:             InfiniBand

               

                              port:   2

                                      state:                  PORT_ACTIVE (4)

                                      max_mtu:                4096 (5)

                                      active_mtu:             1024 (3)

                                      sm_lid:                 0

                                      port_lid:               0

                                      port_lmc:               0x00

                                      link_layer:             Ethernet

               

               

              Investigation lead to OpenMPI r4commendations to avoid using the same

              default GID prefix for both IB and Eth ports.

               

              Is this something you recommend? Or how should I go about making a clean

              configuration where both 10GigE and IB are properly setup?

               

              One other twist; some nodes have RoCE enabled. How do I configure IB so

              that it wont interfere with RoCE and vice versa?

               

              Thanks!

               

              Michael

            • Re: Proper Configuration for IB-FDR and RoCE
              eddie.notz

              Hi Michael,

               

              Open MPI by default tries to run from all capable RDMA ports in the system.

              Since our HCA and driver support RoCE, it tries to run over the 10GbE port as well.

               

              To include/exclude the desired HCA/port for OpenMPI, use the mca parameter, for

              example:

              %mpirun -mca btl_openib_if_include “mlx4_0:1,mlx4_1:1”

              <…other mpirun parameters…>

               

              In your case it should be:

               

              %mpirun -mca btl_openib_if_include “mlx4_0:1” <…other mpirun parameters…>

                • Re: Proper Configuration for IB-FDR and RoCE

                  Hi Eddie,

                   

                  thanks for the reply!

                   

                  Yes, we also notice that if we include explicitly the right IB i/f MPI

                  communication proceeds smoothly.

                   

                  Is it something I can do with (say) opensm or in some other place that we

                  can ensure that MPI stacks use a specific i/f ? It so happens that some

                  groups my like to use RoCE independently of MPI (as in using GASNET from

                  UPC). Can we have RoCE and IB coexit but not interfere with each other when

                  either MPI is used or independently ?

                   

                  I saw some recommendations in the OpenMPI site to provide different deafult

                  GID prefix for the 10GigE i/f from that of the IB i/f.

                   

                  Thanks...

                   

                  Michael

                    • Re: Proper Configuration for IB-FDR and RoCE
                      eddie.notz

                      Hi Michael,

                       

                      If you are referring to:

                      http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

                      Then it actually talks about when the node have 2 IB ports connected to a different IB fabric.

                       

                      unfortunately you have to use the -mca btl_openib_if_include paramater, otherwise the traffic will be  automatically

                      load balanced across the InfiniBand and Ethernet ports.

                        • Re: Proper Configuration for IB-FDR and RoCE

                          Thanks, that's a good point.... I guess the OpenMPI stack things that it

                          has 2 phy transports and trying to load-share, runs into connectivity

                          problems.

                           

                          Do you think we can have both RoCE and IB active on the same set of hosts?

                          Some groups here would like to use RoCE but of course MPI/IB is the

                          communcation of choice.

                           

                          Actually here is a question for the selection of routes among end-points:

                          from the Fat-Tree topology we have multiple alternative paths from

                          connecting each end-point (X, *Y *). Who determines the specific route

                          communication will take from two specific end-points (A, B ) ? Is it

                          the MPI stack itself at IB connection establishment time or does it consult

                          with SM ? Re-routing (or selection of alternative than the initial one) can

                          take place at the request of say the MPI stack or does the SM have to be

                          consulted or adjust its own routing tables or ?

                           

                          Finally we are using the OFED that just came with RHEL 6.3 (I think 1.5.4

                          ?) for various mostly non-technical reasons. Do you have any concrete

                          argument in favor of deploying Mellanox own latest OFED for that Linux

                          distribution?

                           

                          Thanks!

                          Michael