Thanks for your reply. But I'm looking for theshold information. Like I told in previous message, you can have 50 symbol error in an hour or in a week. I want to know when these 50 errors are important. The same for all the port counters of an Infiniband link. I need to debug the situation and I don't know and I don't know where to look for information about what are dangerous numbers on Infiniband counters.
To monitor such counters, i suggest using ibdiagnet tool
ibdiagnet performs quality and health checks, scans the fabric and extracts connectivity and devices available information.
An ibdiagnet run performs the following:
• Fabric discovery
• Duplicated GUIDs detection
• Duplicate Node Description detection
• Alias GUIDs check
• Lids check
• Links in INIT state and unresponsive links detection
• Counters fetch
• Error counters check
• Counter increments during run detection
• BER test
• Routing checks
• Link width and speed checks
• Topology matching.
• Partition checks
./ibdiagnet -P all=1 --pc --pm_pause_time 1200 --get_cable_info -r -o /tmp/$(date +%Y%m%d)
The output logs will be under “/var/tmp/ibdiagnet2/”