MLNX_OFED ConnectX-3 RDMA Diagnostic Counters

Version 7

    Availability

     

    The following RDMA transport diagnostic counters are available in the Mellanox OFED distribution release for any of the ConnectX-3 family adapters.

     

    Location

     

    The list of counters can be found under /sys/class/infiniband/mlx4_<n>/diag_counters/

     

    # ls /sys/class/infiniband/mlx4_0/diag_counters/

    clear_diag  num_cqovf  rq_num_lae  rq_num_lpe    rq_num_mce  rq_num_rae   rq_num_rnr      rq_num_udsdprd  sq_num_bre  sq_num_lpe    sq_num_mwbe  sq_num_rae   sq_num_rnr  sq_num_rree  sq_num_wrfe

    num_baddb   num_eqovf  rq_num_lle  rq_num_lqpoe  rq_num_oos  rq_num_rire  rq_num_ucsdprd  rq_num_wrfe     sq_num_lle  sq_num_lqpoe  sq_num_oos   sq_num_rire  sq_num_roe  sq_num_tree

     

    Description

     

    CounterDescription
    num_cqovfnumber of completion entries overflowing a completion queue.
    rq_num_lleResponder - number of local length errors
    rq_num_lpeResponder - number of local protection errors
    rq_num_lqpoeResponder - number of local QP operation errors
    rq_num_oosResponder - number of out of dequence requests received
    rq_num_rae

    Responder - number of remote access errors.

    RKey Violation Responder detected an RKey violation while executing an RDMA request. NAK may or may not be sent.

    rq_num_rire

    Responder - number of Remote Invalid request errors. NAK may or may not be sent.

    1. QP Async Affiliated Error: Unsupported or Reserved OpCode (RC only): Inbound request OpCode was either reserved, or was for a function not supported by this QP. (E.g. RDMA or ATOMIC on QP not set up for this).

    2. Misaligned ATOMIC: VA does not point to an aligned address on an Atomic operation.

    3. Too many RDMA Read or ATOMIC Requests: There were more requests received and not ACKed than allowed for the connection.

    4. Out of Sequence OpCode, current packet is "First" or "Only": The Responder detected an error in the sequence of OpCodes; a missing "Last" packet

    5. Out of Sequence OpCode, current packet is not "First" or "Only": The Responder detected an error in the sequence of OpCodes; a missing "First" packet

    6. Local Length Error: Inbound "Send" request message exceeded the responder’s available buffer space.

    7. Length error: RDMA Write request message contained too much or too little pay-load data compared to the DMA length advertised in the first or only packet.

    8. Length error: Payload length was not consistent with the opcode:

    a: 0 byte <= "only" <= PMTU bytes

    b: ("first" or "middle") == PMTU bytes

    c: 1byte <= "last" <= PMTU bytes

    9. Length error: Inbound message exceeded the size supported by the CA port.

    rq_num_rnrResponder - the number of RNR NAKs sent
    rq_num_wrfe

    Responder - number of CQEs with error. Incremented each time a CQE with error is generated

    sq_num_breRequester - number of bad response errors
    sq_num_lleRequester - number of local length errors
    sq_num_lpeRequester - number of local protection errors
    sq_num_lqpoeRequester - number of local QP operation errors
    sq_num_mwbeRequester - number of Memory Window bind errors
    sq_num_oosRequester - number of Out of Sequence NAKs received
    sq_num_rae

    Requester - number of remote access errors. NAK-Remote Access Error on R_Key Violation. Responder detected an invalid RKey while executing an RDMA request.

    sq_num_rire

    Requester - number of Remote Invalid request errors. NAK-Invalid Request on:

    1. Unsupported OpCode: Responder detected an unsupported OpCode.

    2. Unexpected OpCode: Responder detected an error in the sequence of OpCodes, such as a missing "Last" packet.

    Note: there is no PSN error, thus this does not indicate a dropped packet.

    sq_num_rnrRequester - the number of RNR NAKs received
    sq_num_roe

    Requester - number of remote operation errors. NAK-Remote Operation Error on Remote Operation Error: Responder encountered an error, (local to the responder),

    which prevented it from completing the request.

    sq_num_rreeRequester - number of RNR NAK retries exceeded errors
    sq_num_treeRequester - number of transport retries exceeded errors
    sq_num_wrfe

    Requester - number of CQEs with error. Incremented each time a CQE with error is generated

     

    Note: The following counters are obsolete. num_baddb, num_eqovf, rq_num_lae,rq_num_leeoe, rq_num_mce, rq_num_rsync, rq_num_ucsdprd, rq_num_udsdprd,

    sq_num_ieecne, sq_num_ieecse, sq_num_leeoe, sq_num_rabrte, sq_num_rsync