This post will introduce you to the Performance Tests (perftest) package for OFED.
perftest package is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for tuning as well as for functional testing.
The perftest package contains a set of bandwidth and latency benchmark such as:
InfiniBand / RoCE
Set the Service Level (SL) of the benchmark. If used for Ethernet, it sets the egress L2 priority of the packet.
There are various examples in the community, one of them is How To Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)
The package scripts counts all the frame including the Ethernet header and the CRC field (in case of Ethernet). preamble and IPG are not being counted for BW.
Testing Methodology and Notes
- The benchmarks use the CPU cycle counter to get time-stamps without context switching.
- The latency benchmarks measure round-trip time but report half of that as one-way latency. This means that the results may not be accurate for asymmetrical configurations.
- On all unidirectional bandwidth benchmarks, the client measures the bandwidth. On bidirectional bandwidth benchmarks, each side measures the bandwidth of the traffic it initiates. At the end of the measurement period, the server reports the result to the client, and the client adds it to its measurement.
- Latency tests report minimum, median and maximum latency results. The median latency is typically less sensitive to high latency variations, compared to average latency measurement. Typically, the first value measured is the maximum value due to warm-up effects.
- Long sampling periods have very limited impact on measurement accuracy. The default value of 1000 iterations is pretty good. Note that the program keeps data structures with memory footprint proportional to the number of iterations. Setting a very high number of iterations may have negative impact on the measured performance, as it can offset the performance by contributions that are not related to the devices under test. If a high number of iterations is strictly necessary, it is recommended to use the -N flag (No Peak).
- Bandwidth benchmarks may be run for a number of iterations, or for a fixed duration. Use the -D flag to instruct the test to run for the specified number of seconds. The --run_infinitely flag instructs the program to run until interrupted by the user, and to print the measured bandwidth every 5 seconds.
- The -H option in latency benchmarks dumps a histogram of the results.
- When the post_list feature (-l, --post_list=<list size>) is used, each QP prepares <list size> ibv_send_wr operations (instead of 1), and chains them to each other. Chaining in this context means allocating a <list_size> array, and setting a 'next' pointer for each ibv_send_wr in the array that points to the following element in the array. (The last ibv_send_wr in the array will point to NULL.) When post_send'ing the first ibv_send_wr in the list, the hardware will post all of those WQEs (each post_send will post <list_size> messages).
- RDMA Connected Mode (CM): You can add the -R flag to all tests to connect the QPs from each side with the rdma_cm library.
- Multicast support in ib_send_lat and in ib_send_bw: Send tests have built-in features of testing multicast performance at the verbs level. You can use -g to specify the number of QPs to attach to this multicast group. The -M flag allows you to choose the multicast group address.
Note: Different versions of perftest may not be compatible with each other. Please use the same perftest version on both sides to ensure consistency of benchmark results.