The following RDMA transport diagnostic counters are available in the Mellanox OFED distribution release for any of the ConnectX-3 family adapters.
The list of counters can be found under /sys/class/infiniband/mlx4_<n>/diag_counters/
# ls /sys/class/infiniband/mlx4_0/diag_counters/
clear_diag num_cqovf rq_num_lae rq_num_lpe rq_num_mce rq_num_rae rq_num_rnr rq_num_udsdprd sq_num_bre sq_num_lpe sq_num_mwbe sq_num_rae sq_num_rnr sq_num_rree sq_num_wrfe
num_baddb num_eqovf rq_num_lle rq_num_lqpoe rq_num_oos rq_num_rire rq_num_ucsdprd rq_num_wrfe sq_num_lle sq_num_lqpoe sq_num_oos sq_num_rire sq_num_roe sq_num_tree
|num_cqovf||number of completion entries overflowing a completion queue.|
|rq_num_lle||Responder - number of local length errors|
|rq_num_lpe||Responder - number of local protection errors|
|rq_num_lqpoe||Responder - number of local QP operation errors|
|rq_num_oos||Responder - number of out of dequence requests received|
Responder - number of remote access errors.
RKey Violation Responder detected an RKey violation while executing an RDMA request. NAK may or may not be sent.
Responder - number of Remote Invalid request errors. NAK may or may not be sent.
1. QP Async Affiliated Error: Unsupported or Reserved OpCode (RC only): Inbound request OpCode was either reserved, or was for a function not supported by this QP. (E.g. RDMA or ATOMIC on QP not set up for this).
2. Misaligned ATOMIC: VA does not point to an aligned address on an Atomic operation.
3. Too many RDMA Read or ATOMIC Requests: There were more requests received and not ACKed than allowed for the connection.
4. Out of Sequence OpCode, current packet is "First" or "Only": The Responder detected an error in the sequence of OpCodes; a missing "Last" packet
5. Out of Sequence OpCode, current packet is not "First" or "Only": The Responder detected an error in the sequence of OpCodes; a missing "First" packet
6. Local Length Error: Inbound "Send" request message exceeded the responder’s available buffer space.
7. Length error: RDMA Write request message contained too much or too little pay-load data compared to the DMA length advertised in the first or only packet.
8. Length error: Payload length was not consistent with the opcode:
a: 0 byte <= "only" <= PMTU bytes
b: ("first" or "middle") == PMTU bytes
c: 1byte <= "last" <= PMTU bytes
9. Length error: Inbound message exceeded the size supported by the CA port.
|rq_num_rnr||Responder - the number of RNR NAKs sent|
Responder - number of CQEs with error. Incremented each time a CQE with error is generated
|sq_num_bre||Requester - number of bad response errors|
|sq_num_lle||Requester - number of local length errors|
|sq_num_lpe||Requester - number of local protection errors|
|sq_num_lqpoe||Requester - number of local QP operation errors|
|sq_num_mwbe||Requester - number of Memory Window bind errors|
|sq_num_oos||Requester - number of Out of Sequence NAKs received|
Requester - number of remote access errors. NAK-Remote Access Error on R_Key Violation. Responder detected an invalid RKey while executing an RDMA request.
Requester - number of Remote Invalid request errors. NAK-Invalid Request on:
1. Unsupported OpCode: Responder detected an unsupported OpCode.
2. Unexpected OpCode: Responder detected an error in the sequence of OpCodes, such as a missing "Last" packet.
Note: there is no PSN error, thus this does not indicate a dropped packet.
|sq_num_rnr||Requester - the number of RNR NAKs received|
Requester - number of remote operation errors. NAK-Remote Operation Error on Remote Operation Error: Responder encountered an error, (local to the responder),
which prevented it from completing the request.
|sq_num_rree||Requester - number of RNR NAK retries exceeded errors|
|sq_num_tree||Requester - number of transport retries exceeded errors|
Requester - number of CQEs with error. Incremented each time a CQE with error is generated
Note: The following counters are obsolete. num_baddb, num_eqovf, rq_num_lae,rq_num_leeoe, rq_num_mce, rq_num_rsync, rq_num_ucsdprd, rq_num_udsdprd,
sq_num_ieecne, sq_num_ieecse, sq_num_leeoe, sq_num_rabrte, sq_num_rsync