1 Reply Latest reply on Nov 1, 2014 6:25 AM by halr

    Testing sub-optimal shapes in torus-2QoS

      We are testing out a torus type topology in my environment and ran into an issue.  We have 2 currently in testing, 1 that works which is a 4x2x1 and a second that doesn't, configured as a 7x2x1.  The 7x2x1 experiences credit loop problems when validating with ibdmchk

       

      (ofed 1.5.3.4.0.12)

       

      Using ibsim I replicated the credit loop problem with a virtual subnet configured with the same number of CA and switches.  I reconfigured it as a 4x3x1 with the 7th cabinet disconnected and I think we can work with that.  I thought the problem may have stemmed from having the Y axis only 2 switches tall, so the "ym_link" and "yp_link" had the same source and destination.

       

       

      Out of curiosity I created a 36x3x1 fabric, 36 port switches, 16 CAs per switch.  Here is the torus-2QoS.conf file.  I used Switch73 as coordinates 0x0x0.  This still causes credit loops and I have NO IDEA why.  Any thoughts?  I'm using ibsim 0.5 for the tests.  The topology is created via a perl script I wrote.

       

      Switch1 => Switch2 => Switch3 => ...

      Switch37 => Switch38 => Switch39 => ...

      Switch 73 => Switch74 => Switch75 => ...

       

      torus 36t 3t 1

      xp_link 0x200048 0x200049

      xm_link 0x200048 0x20006b

      yp_link 0x200048 0x200000

      ym_link 0x200048 0x200024

      portgroup_max_ports 20

       

      Here is the output from opensm.log upon starting it up:

       

      [root@cinhpcdev4 torus-test2]# cat opensm.log

      Nov 07 12:10:32 997544 [9E1C3780] 0x43 -> OpenSM 3.3.13.MLNX_20130110_cd124d3

      Nov 07 12:10:32 999452 [9E1C3780] 0x80 -> OpenSM 3.3.13.MLNX_20130110_cd124d3

      Nov 07 12:10:33 275154 [9E1C3780] 0x02 -> osm_vendor_init: 100 pending umads specified

      Nov 07 12:10:33 294725 [9E1C3780] 0x80 -> Entering DISCOVERING state

      Nov 07 12:10:33 331269 [9E1C3780] 0x02 -> osm_vendor_bind: Binding to port 0x200000

      Nov 07 12:10:33 432034 [9E1C3780] 0x02 -> osm_vendor_bind: Binding to port 0x200000

      Nov 07 12:10:33 449985 [9E1C3780] 0x02 -> osm_vendor_bind: Binding to port 0x200000

      Nov 07 12:10:33 468021 [9E1C3780] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0000000000200000

      Nov 07 12:10:34 637631 [CF49E940] 0x80 -> Entering MASTER state

      Nov 07 12:10:34 637692 [CF49E940] 0x01 -> osm_prtn_make_partitions: Partition configuration ./partitions.conf is not accessible (No such file or directory)

      Nov 07 12:10:37 256120 [CF49E940] 0x02 -> torus_build_lfts: Found fabric w/ 2592 links, 108 switches, 1728 CA ports, minimum 8 data VLs

      Nov 07 12:10:37 256148 [CF49E940] 0x02 -> torus_build_lfts: Looking for 36 x 3 x 1 torus

      Nov 07 12:10:37 256165 [CF49E940] 0x02 -> build_torus: Using torus seed configured as default (seed sw 0,0,0 GUID 0x200048).

      Nov 07 12:10:37 257829 [CF49E940] 0x02 -> torus_build_lfts: Built 36 x 3 x 1 torus w/ 2592 links, 108 switches, 1728 CA ports

      Nov 07 12:10:37 309800 [CF49E940] 0x02 -> osm_ucast_mgr_process: torus-2QoS tables configured on all switches

      Nov 07 12:10:37 309874 [CF49E940] 0x01 -> osm_qos_parse_policy_file: ERR AC01: Failed opening QoS policy file ./qos-policy.conf - No such file or directory

      Nov 07 12:10:45 156308 [CF49E940] 0x02 -> SUBNET UP

      Nov 07 12:10:45 163642 [9E1C3780] 0x80 -> Exiting SM

       

       

      And here is the relevant output from ibdmchk -s ./opensm-subnet.lst -f ./opensm.fdbs -m ./opensm.mcfdbs -d ./opensm-sl2vl.dump

       

      -I- Scanning all multicast groups for loops and connectivity...

      ---------------------------------------------------------------------------

       

      -I- Using full credit loop check.

      -I- Analyzing Fabric for Credit Loops 1 SLs, 8 VLs used.

      -I- Traced 2984256 unicast paths

      -E- Credit loop found on the following path:

          S0000000000100000/N0000000000100000/P1 VL: 0 on path from lid: 0x0002 to lid: 0x03a8

          S0000000000200000/N0000000000200000/P21 VL: 0 on path from lid: 0x0002 to lid: 0x03a8

          S0000000000200023/N0000000000200023/P21 VL: 0 on path from lid: 0x0734 to lid: 0x0196

          S0000000000200022/N0000000000200022/P21 VL: 0 on path from lid: 0x061a to lid: 0x0156

      ..... <snippet>

        • Re: Testing sub-optimal shapes in torus-2QoS
          halr

          Don't know if this is still issue or not but here are some comments:

           

          I have not tested any of the topologies you mention:

          4x2x1

          7x2x1

          36x3x1

          The largest torus I've verified is 10x10x10.

           

          These are all 2D rather than 3D tori. Note that a 2D torus must be configured with either the x or y radix
          as 1 (i.e. configured as either a 1 x m x n or a m x 1 x n torus).

           

          Also, the ones which are 2x1 are limited in fault (link or switch failure) in dimension with 2 switches but this has nothing to do with credit loops in non faulted case.

           

          Looks like you are used MLNX OFED OpenSM. There have been a number of fixes/improvements to torus since the one you are using. If this is still of interest and still a problem, I would recommend updating to the most recent version (either MLNX OFED or upstream (latest 3.3.18 release) and retrying this. If it's still a problem, would you post your ibnetdiscover output and the OpenSM configuration ?