ConnectX-3 Performance Diagnostic Counters for Windows 2012

Version 3

    This post describes how to use Mellanox diagnostic counter to analyze and diagnose networking issues for ConnectX-3 Adapter in Windows 2012 Environment.

    The post aims for IT administrators, advance users and developers.

     

    References

     

    ConnectX-3 diagnostic counters are a set of counters representing transport operations. The diagnostic counters can be used for debugging local and system issues by understanding the different error flow activated by the adapter engines.

     

    You can find Mellanox  Adapter Diagnostic Counters in Perfmon tool . For dual port adapters, the counters are per port where each port can be selected through its respective device in the Perfmon Instances list.

     

     

    Note: With the exception of CQ overflow counter, all counters belong to either the send side, request side, or the receive side, respond side.

     

    Diagnostic Counters Table

     

    Counter Description Troubleshooting
    CQ Overflow

    This counter is incremented when the number of un-polled completions exceed the maximum number of entries in the completion queue. These completions are generated by all completed or flushed work requests that are created by all the QPs associated with a completion.

    A CQ overflow is an indication of a software design issue. Software should choose its CQ size in a way that all completions reported to a CQ are written and never lost due to CQ overflows. It should be noted that CQ overflow conditions lead to transitioning all QPs affiliated with that CQ to error. This is necessary because it is not possible to determine which CQ entry was lost due to the overflow.

    Use this counter as an indication of a problematic software design, where the total number of entries in all QPs associated with a CQ exceeds the number of entries of that CQ.
    Requester CQE Errors

    This counter counts each completion generated with an error syndrome. Meaning, it represents all work requests posted to the requester side of an adapter that were not completed successfully.

    It is worth mentioning that not all increases of this counter are an indication of a system error. Certain requests might result in error also in normal operating condition. For example, peer node was brought down by the system administrator, an application was brought down while transactions were pending, a cable was unplugged while port was sending reliable data, etc. This means that these counter should be taken in context and not as a sole indication of a problem.

    As mentioned before; this counter should be taken in context:

    • For a reliable transport, such as RC, it is an indication that there was a local or remote condition impeding the completion of a request. A remote condition means that the responder reported back to the requester that the request is invalid. It could also indicate a timeout error where the remote node didn’t send a reply to a request expecting an acknowledgment . The counter can also be an indication of a local operation error. For example, accessing an invalid memory address using the wrong memory keys, etc.
    • For an unreliable transport, such as UD or RAW Ethernet, it can only be an indication of a local operation error,  because these are unconnected transports and there is no responder associated with the requester.
    Requester Invalid Request Errors

    This counter counts the number of packets sent by a responder indicating the responder unwillingness or inability to perform a request.

    The following are typical scenarios in which a requester might receive an invalid request from a responder:

    • Invalid Opcode Transaction: the requester sent a request that was not supported by the responder. For example, an RDMA operation request for a responder that does not support RDMA.
    • Packet Loss: messages have a packet sequence (first, middle…middle, last). Packet loss can lead a valid sequence to break, creating an invalid request.
    • Receive buffer side too small: if a SEND packet is received into a buffer that is too small for its size, the responder will reply with an invalid request response.
    • Any of the following are probable causes for increases of this counter:
    • Packet Loss: indicates possible drops in the fabrics, and is generally accompanied by increases in "Requester Out-of-order Sequence NAK" counter. This might be caused due to a misconfiguration of flow control through the fabric.
    • Misconfiguration in transactions supported by responder. For example, no RDMA support configured. This is most likely caused by software programming errors.
    •   Buffer Allocation Error: software on the requester side is creating messages that are bigger than the available space on the receiver side.
    Requester Length Errors

    This counter is increased when the amount of data received for an RDMA READ response is different from the requested amount. This indicates a protocol violation and should not occur unless the peer node violates the protocol.

    This counter points to a buggy responder implementation.
    Requester Out of order sequence NAK

    In a reliable transport, each packet has a sequence number assigned to it, which is increased in each subsequent packet. The responder reports to the request if it detects gaps in the sequence of packets received from the requester. The report is done through a Negative Acknowledge packet (NAK).

    The "Requester Out of order sequence NAK" counter counts the number of out-of-order sequence NAKs sent from the responder to the requester.

    This usually occurs when packets are lost in the fabric. For example, as a result of link over-subscription and lack of proper flow control configuration. Use this counter as an indication of packet loss in the fabric or drop at the receiving adapter.

    Requester Protection Errors

    A requester protection error on the sender side represents a failure to locally access memory for the execution of a work request. Memory is accessed locally for a work request when the adapter attempts to read data. This is in order to build an RDMA WRITE or SEND message, or write data when the responses of an RDMA READ arrive from the responder.

    This counter can be an indication of a software issue if the request descriptor contains any invalid segment: memory address, memory key, size.

    It can also be a hardware issue when in the process of reading or writing data from or to memory, a PCI abort transaction occurs.

    Requester QP Operations ErrorsThis counter is increased when an unexpected error occurs while accessing a QP to perform a requester operation.These errors imply that the adapter is in an invalid state or the PCI is not able to read a QP or its resources.
    Requester Remote Access Errors

    When executing an RDMA transaction, if the requester attempts to execute a remote operation not granted by the responder, the responder will reply with a remote access error NAK causing this counter to increment. The most common reason for a Remote Access NAKs to be sent are:

    • Invalid address for given memory Key
    • Invalid memory key
    • Size and offset exceed memory region boundaries
    This counter indicates a programming error where each one of the above parameters might be invalid.
    Requester Remote Operations ErrorsThis is an indication that the responder failed to execute the request due to problems relevant to the responder. For example, accessing its own memory for read/write or any of the local resources needed to accomplish the remote requestUse this error as an indication that the responder side is failing due to software or hardware issues that might obstruct executing the remote request.
    Requester RNR NAK

    An RNR is generated on a reliable transport connection when a requester sends a SEND message to a responder while the responder doesn't have an available buffer to handle the message. This is a nondestructive event, meaning, the responder and requester keep processing packets normally even after the event occurs.

    The counter is incremented each time such a NAK packet is receive by the requester.

    The counter is an indication of the existence of a slow receiver - a receiver that does not post buffers fast enough compared to the message rate generated by the sender.

    Solving this issue might require increasing the number of available buffers on the receiver side at an application level, or optimizing the rate of buffer posting at the receiver by making sure the posting application gets enough CPU to post buffers on time.

    Requester RNR NAK Retries Exceed Errors

    Once the number of RNR NAKs received by the requester exceeds a per connection threshold, the connection is moved to error state and all pending transactions flush in error.

    That condition is represented by this counter and indicates an extremely slow receiver.

    The counter is an indication of the existence of a slow receiver - a receiver that does not post buffers fast enough compared to the message rate generated by the sender.

    Solving this issue might require increasing the number of available buffers on the receiver side at an application level, or optimizing the rate of buffer posting at the receiver by making sure the posting application gets enough CPU to post buffers on time.

    Requester Timeout Received

    In a reliable connected transport, each packet or sequence of packets from requester to responder has to be acknowledged within a per-requester-configurable amount of time.

    If a request is not acknowledged within that time, this counter will be incremented.

    Increments of this counter can be caused when the software configures the connection timeout field to values too low for the round trip time of the fabric. It might also be an indication that the fabric suffers from congestion caused by hot spots or other component slowing the flow of data.

    It is worth mentioning that this type of timeouts do not cause the tear-down of connections. Hardware does automatically retry sending data when timeouts occur.

    Requester Transport Retries Exceeded ErrorsWhen per connection pre-configured number of consecutive timeouts occur, as described in "Requester RNR NAK Retries Exceed Errors" , the adapter will tear down the connection and increment the value of this counter.

    This indicates that the fabric suffers from large route trip times or that the timeout values configured are too low.

    It is recommended to check the per priority pause counters of the fabric switches and adapters to identify possible hot spots.

    It is worth mentioning that this counter increases results on the tear-down of the affected connection as the adapter gives up trying to successfully complete requests on it.

    Responder CQE Errors

    Number of local CQE with errors when the local machine receives inbound traffic.

    Responder Duplicate Request ReceivedNumber of duplicate requests received when the local machine receives inbound traffic.
    Responder Invalid Request ErrorsNumber of remote invalid request errors when the local machine receives inbound traffic.
    Responder Length ErrorsNumber of local length errors when the local machine receives inbound traffic.
    Responder Out-of-order Sequence receivedNumber of Out of Sequence packets received when the local machine receives inbound traffic. Meaning the number of times the local machine receives messages that are not consecutive.
    Responder Protection ErrorsNumber of local protection errors when the local machine receives inbound traffic.
    Responder QP Operations ErrorsNumber of local QP operation errors when the local machine receives inbound traffic.
    Responder Remote Access ErrorsNumber of remote access errors when the local machine receives inbound traffic. Meaning the local machine received RDMA request with wrong rkey.
    Responder RNR NAKNumber of RNR (Receiver Not Ready) NAKs sent when the local machine receives inbound traffic.