3 Replies Latest reply on Sep 16, 2017 11:25 PM by march

    How to test RDMA traffic congestion


      Hi. We're trying to debug issues we see periodically with Lustre Networking on top of CX-3 and CX-4 based RoCE(v1) fabrics using SR-IOV for connections from Lustre clients running as KVM guests (servers are bare-metal). When we hit these errors we see drop/error counters going up on the hosts.


      So far all simple ib tests between host-pairs look ok, now we want to test congestion scenarios, e.g., 2 hosts sending to 1 host. However we've discovered that whilst e.g. ib_write_bw has an option to specify more than one QP, it actually doesn't support it! Is there a simple way to engineer such a test or are we going to have to write something or move to an MPI based test suite...?

        • Re: How to test RDMA traffic congestion

          Here few steps to try to analyse your congestion problem:



          What is IB congestion?

          • IB congestion is a situation where nodes fail to send data or send rate decreases
          • In most cases when an IB network is experiencing congestion, there will be no packets drops. Just slowness
          • Usually IB congestion is caused by a slow node receiver.

          It can also cause by the network itself in cases where the network is blocking by design or due to an issue

          How to identify congestion situation:

          • Network is slow. All or some of the nodes packet rate decreases dramatically
          • No packet drops in the fabric. If the network drops packets it is probably not real congestion, just a physical problem that should be locally identified and fixed


          Suspect #1: Physical Layer Issues

          • Ø Ibdiagnet diagnostic

          Physical layer issues can cause degraded performance of the fabric. In order to eliminate any impact on the fabric by physical layer issues, fabric cleanup is required.

          Information on fabric status and ports’ counters can be collected using the ibdiagnet tool (from the UFM server where we have the ibdiagnet2 version installed):

          ibdiagnet -r -pc -P all=1 --pm_pause_time 600 -o <output_dir>

          • It is recommended specifying the output directory so files will not get overwritten
          • Output files can be used in other sections of this technical guide

          In the ibdiagnet2.log file, need to look for ports reporting on one or more of the following physical layer issues:

          • link_down_counter – ignoring scheduled servers’ reboot


          -E- lid=0x0143 dev=51000 xxxxxxxx/U1/P36

          Performance Monitor counter : Value

          link_down_counter : 3 (threshold=0)


          • Links degraded speed and width – links with reduced capability will be reported in the “Speed / Width checks” section


          Speed / Width checks

          -I- Link Speed Check (Compare to supported link speed)

          -E- Links Speed Check finished with errors

          -E- Link: S0002c902004213d3/N0002c902004213d0(Infiniscale-IV Mellanox Technologies)/P24<-->switch-1137be:IS5030/U1/P32 - Unexpected actual link speed 2.5


          -I- Link Width Check (Expected value given = 4x)

          -E- Links Width Check finished with errors

          -E- Link: S0002c902004213d3/N0002c902004213d0(Infiniscale-IV Mellanox Technologies)/P24<-->switch-1137be:IS5030/U1/P32 - Unexpected width, actual link width is 1x


          • link_error_recovery_counter


          -E- lid=0x0009 dev=51000 xxx/U1/P32

          Performance Monitor counter : Value

          link_error_recovery_counter : 255 (overflow)


          • max_retransmission_rate – check for increments during test run. Look for anything greater than threshold of 500 (the threshold mentioned in the example below is set by the ibdiagnet test flag “-P all=1”)


          -E- Ports counters Difference Check (during run) finished with errors

          -E- Sf4521403004d20a0/r xxx/P6 - "max_retransmission_rate" increased during the run (difference value=1,difference allowed threshold=1)


          • symbol_error_counter – relevant only for non FDR/FDR10 links


          -E- lid=0x016e dev=23131 S0008f1040040c018/N0008f10500650e4e/P30

          Performance Monitor counter : Value    

          symbol_error_counter : 65535      (overflow)


          • Ø UFM Port Counters CSV diagnostic

          Configuring UFM to collect PortCounters CSV files in gv.cfg configuration file:


          max_files= 5

          write_interval= 30

          ext_ports_only= no

          Output files will be saved in this location on the UFM server: /opt/ufm/files/csv/.

          1. Extract the latest file and open with Excel
          2. Form a table
          3. Relevant column for physical layer issues:
            1. E: Width – look for any port without 4x width
            2. T: SymErr – SymbolError. Relevant for non FDR/FDR10 links
            3. U: LinkRecovers
            4. V: LinkDowned
            5. AY: Speed – look for any degraded rate
            6. AZ: Status – look for anything not OK

          Device name and port can be found in columns P and B respectively.




          Suspect #2: Unresponsive node/s issue 

          Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

          If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

          1. Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
          2. If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node



          root # ibnetdiscover

          src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

          src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out


          # Topology file: generated on Mon Mar  2 17:19:19 2016


          # Initiated from node f4521403008b9a30 port f4521403008b9a31

          1. Identify the unresponsive node/s:
            1. From the same node where the direct path command invoked, run:

          smpquery nd -D <direct_path_without_last_number>


          Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"


          1. The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path


          Example: for direct path "0,1,18", the unresponsive device will be connected to port 18



          Suspect #3: Slow Receivers

          • Nodes that pushes back on data because it can’t process data fast enough
          • A slow node will not give the switch credits to send traffic. The backpressure will spread on to other connected switches by allocating buffer space for delayed traffic


          Congested links:

          • Indication for a congested link is a link that sends or receive high amount of data (high XmitPacket/RcvPacket) and is also having high rates of XmitWait
          • We can get a clear indication for congestion if: WmitWait / XmitPackets >10

          (Ratio between XmitWait and the XmitPacket is bigger than 10)


          Possible causes for slow receiver:

          • Server resources
            • CPU speed – it is recommended to work with CPU in max performance mode
            • Memory - bad memory dimm or memory section can decrease the server performance. This can only be detected with low-level memory testing utilities
          • PCI connection – degraded Gen (speed) and/or width


          More information can be found in the Performance Tuning Guide document.


          • Ø Detecting slow receivers using PortCounters CSV file

          For using this method, the reset counters policy should be reset_every_poll (only data counters will be reset).


          1. Extract 2x latest CSV files (by name convention)
          2. Open the 2 files in Excel and format as tables
          3. Copy the XmitWait column from the older file to the new file right next to the XmitWait column in the newer file
          4. Insert new column (NEW_ XmitWait) and calculate the delta between the 2 XmitWait values (we want the number of ticks counted between the 2 files)
          5. In column D (NodeType) select only Switch
          6. In Column AR (PeerPlatform) select only Computer
          7. Insert new column, Congestion Ratio, and add formula of: NEW_ XmitWait/XmitPkts
          8. Sort Congestion Ratio column from largest to smallest
          9. Start from the top on any transmitting port reporting on a ratio greater than 10


          • Ø Detecting slow receivers using ibdiagnet2

          With this method, manual mapping between GUIDs and hostname is required.

          This can be done using the Excel vlookup function and any parsed hostname <-> GUIDs list.


          1. Copy the “PM_INFO” data from the f ibdiagnet2.db_csv file to Excel sheet and for a table


          Example – all other columns are hidden:


          1. Calculate the Congestion index = XmitWait / XmitPkt

          Using 32/64 bits counters. 64 bit Counters requires additional translation from Hex to Dec





          1. Complete data & Analyze results

          Congestion index: Normalized XmitWait [ticks] = ∆XmitWait  / ∆XmitPackets

          • Avg # of ticks packet waits in Head of Queue

          Ports with Congestion index >= 10 should be treated as congested






          Suspect #4: Network issues

          • Ø Routing issue

          Routing issues can be investigated by Mellanox support using the following information:

          • ibdiagnet output files 
          • Opensm log
          • Opensm configuration files (/opt/ufm/files/conf/opensm/)
          • ibnetdiscover
          • partitions.conf
          • /opt/ufm/files/log/ opensm-sa.dump
          • Root GUIDs file


          • Ø Topology change

          Using MSTK:

          Missing links or devices can cause degradation in performance.

          You can use the /opt/ufm/support/MSTK5.5/Linux/Host-Tools/ib-topology-viewer.sh script on the UFM server for backing up reference topology summary and comparing to any new collected topology summary.


            [root@xxx Host-Tools]# ./ib-topology-viewer.sh


          ib-topology-viewer.sh Version 5.5


          MF0;xxx:SX6036/U1(0x0002c903004693c1)                                                                          1 HCA ports and 2  switch ports.

          SwitchIB Mellanox Technologies(0x7cfe9003009ea930)                                      2  HCA ports and 3  switch ports.

          SwitchIB Mellanox Technologies(0x7cfe900300bf8530)                                      1  HCA ports and 1  switch ports.

          Using ibnetdiscover:

          1. Cache ibnetdiscover data – this will be the reference data:

          ibnetdicover --cache <file>

          1. Compare any new ibnetdiscover to the cached data:

          ibnetdiscover --diff <cache_file>


          Output will contain changed between cached data and new ibnetdiscover output.






          1 of 1 people found this helpful