3 Replies Latest reply on Nov 6, 2018 11:18 PM by samerka

    Error counters and thresholds

    mgil

      Hi all.

      Executing an "ibcheckerrors" on a node, I see several errors reported beyond the thresholds.

       

      I suppose that these threshold will be related with time, so in a fabric that have not executed ibclearerrors and ibclearcounters in 2 months, maybe 50 Symbol Errors on a port are not important althought the number is beyond the threshold (10).

       

      My question is, where I can check what is normal or expected error counter, and what is the real threshold beyond that number I need to worry about them?

       

      I can't find some document or link where explain what are bad errors, what threshold in what time must be bad, or some thing like this.

       

      I saw this message How to test RDMA traffic congestion where there are some details about test and exepected errors. I'm looking for some thing like this, but explaining all the errors messages that ibdiagnet report, with the worry threshold in a frame time.

       

      Thanks.

          • Re: Error counters and thresholds
            mgil

            Hi Samer.

            Thanks for your reply. But I'm looking for theshold information. Like I told in previous message, you can have 50 symbol error in an hour or in a week. I want to know when these 50 errors are important. The same for all the port counters of an Infiniband link. I need to debug the situation and I don't know and I don't know where to look for information about what are dangerous numbers on Infiniband counters.

            Thanks again.

            BYe...

              • Re: Error counters and thresholds
                samerka

                Hi Manuel,

                 

                To monitor such counters, i suggest using ibdiagnet tool

                ibdiagnet performs quality and health checks, scans the fabric and extracts connectivity and devices available information.

                An ibdiagnet run performs the following:

                • Fabric discovery

                • Duplicated GUIDs detection

                • Duplicate Node Description detection

                • Alias GUIDs check

                • Lids check

                • Links in INIT state and unresponsive links detection

                • Counters fetch

                • Error counters check

                • Counter increments during run detection

                • BER test

                • Routing checks

                • Link width and speed checks

                • Topology matching.

                • Partition checks

                 

                Example:

                ./ibdiagnet -P all=1 --pc --pm_pause_time 1200 --get_cable_info -r -o /tmp/$(date +%Y%m%d)

                The output logs will be under “/var/tmp/ibdiagnet2/”

                 

                Thanks,

                Samer