Understanding mlx4 Linux Counters

Version 1

    This post discusses the Linux port counters located under /sys/class/infiniband/ path.

    The list of counters are aligned with MLNX_OFED 4.0.

     

    Note: the list of counters_ext was merged into port counters starting from MLNX_OFED 4.0. The folder was removed.

     

    References

     

    Counter Groups

    There are two sets of counters

    1. Port Counters under the counters folder

    2. HW counters, under the hw_counters folder

     

    Port Counters

    # ll  /sys/class/infiniband/mlx4_0/ports/1/counters/

    total 0

    -r--r--r-- 1 root root 4096 Jan 19 17:06 excessive_buffer_overrun_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 link_downed

    -r--r--r-- 1 root root 4096 Jan 19 17:06 link_error_recovery

    -r--r--r-- 1 root root 4096 Jan 19 17:06 local_link_integrity_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 multicast_rcv_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 multicast_xmit_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_constraint_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_data

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_remote_physical_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_rcv_switch_relay_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_xmit_constraint_errors

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_xmit_data

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_xmit_discards

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_xmit_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 port_xmit_wait

    -r--r--r-- 1 root root 4096 Jan 19 17:06 symbol_error

    -r--r--r-- 1 root root 4096 Jan 19 17:06 unicast_rcv_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 unicast_xmit_packets

    -r--r--r-- 1 root root 4096 Jan 19 17:06 VL15_dropped

     

    HW Counters (RDMA diagnostics)

    # ll  /sys/class/infiniband/mlx4_0/ports/1/hw_counters/

    total 0

    -rw-r--r-- 1 root root 4096 Jan 19 17:02 lifespan

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_dup

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_lle

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_lpe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_lqpoe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_oos

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_rae

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_rire

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_rnr

    -r--r--r-- 1 root root 4096 Jan 19 17:02 rq_num_wrfe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_bre

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_lle

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_lpe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_lqpoe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_mwbe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_oos

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_rae

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_rire

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_rnr

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_roe

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_rree

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_to

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_tree

    -r--r--r-- 1 root root 4096 Jan 19 17:02 sq_num_wrfe

     

    Counter Description

    Port Counters Description

    CounterDescriptionInfiniBand Spec Name
    Group

    port_rcv_data

    Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.

    PortRcvDataInformative

    port_rcv_packets

    Total number of packets (this may include packets containing Errors. This is 64 bit counter.

    PortRcvPktsInformative
    multicast_rcv_packets

    Total number of multicast packets, including multicast packets containing errors.

    PortMultiCastRcvPktsInformative
    unicast_rcv_packets

    Total number of unicast packets, including unicast packets containing errors.

    PortUnicastRcvPktsInformative

    port_xmit_data

    Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter.

    PortXmitDataInformative

    port_xmit_packets

    Total number of packets transmitted on all VLs from this port. This may include packets with errors.

    This is 64 bit counter.

    PortXmitPktsInformative
    port_rcv_switch_relay_errorsTotal number of packets received on the port that were discarded because they could not be forwarded by the switch relay.PortRcvSwitchRelayErrorsError
    port_rcv_errorsTotal number of packets containing an error that were received on the port.PortRcvErrorsInformative
    port_rcv_constraint_errorsTotal number of packets received on the switch physical port that are discarded.PortRcvConstraintErrorsError
    local_link_integrity_errorsThe number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors.LocalLinkIntegrityErrorsError
    port_xmit_waitThe number of ticks during which the port  had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).PortXmitWaitInformative
    multicast_xmit_packetsTotal number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors.PortMultiCastXmitPktsInformative
    unicast_xmit_packetsTotal number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors.PortUnicastXmitPktsInformative
    port_xmit_discards

    Total number of outbound packets discarded by the port because the port is down or congested.

    PortXmitDiscardsError
    port_xmit_constraint_errorsTotal number of packets not transmitted from the switch physical port.PortXmitConstraintErrorsError
    port_rcv_remote_physical_errorsTotal number of packets marked with the EBP delimiter received on the port.PortRcvRemotePhysicalErrorsError
    symbol_errorTotal number of minor link errors detected on one or more physical lanes.SymbolErrorCounterError
    VL15_droppedNumber of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port.VL15DroppedError
    link_error_recoveryTotal number of times the Port Training state machine has successfully completed the link error recovery process.LinkErrorRecoveryCounterError
    link_downedTotal number of times the Port Training state machine has failed the link error recovery process and downed the link.LinkDownedCounterError