This post provides an overview on MPI Tag Matching, a general work flow, and description of the Verbs API. Information in this post is intended for application developers who need to apply advanced principles of traffic management.
- Understanding MPI Tag Matching and Rendezvous Offloads (ConnectX-5)
- Tag Matching Verbs API and Implementation Example
The Message Passing Interface (MPI) standard defines a set of rules, known as tag matching, for matching source send operations to destination receives.
The following parameters must match a sender and its corresponding receive functions.
- The communicator
- The user tag, including a wild card specified by the receiver
- The source rank, including a wild card specified by the receiver
- The destination rank
The ordering rules require that when more than one pair of send and receive message envelopes match, the pair that includes the earliest posted-send and the earliest posted-receive is the pair that must be used to satisfy the matching operation. However, this does not imply that tags are consumed in the order they are created, for example a tag generated later might be consumed, if earlier tags can not be used to satisfy the matching rules.
When a message is sent from the sender to the receiver, the communication library might attempt to process the operation either after or before the corresponding matching receive message is posted. If a matching receive message is posted, this is defined as an "expected message". If they do not match, they are referred to as an "unexpected message". Implementations frequently use different matching schemes for these two different matching instances.
To keep MPI library memory footprint smaller, MPI implementations typically use only two different protocols for this purpose:
1. The Eager protocol. The complete message is sent when the send message is processed by the sender. A completion send is received in the send_cq notifying that the buffer is freed up and can be reused.
2. The Rendezvous Protocol. The sender sends the tag-matching header, and sometimes a portion of data when it first notifies the receiver that a message is coming. When the corresponding buffer is posted, the responder uses the information from the header to initiate an RDMA READ operation directly to the matching buffer. A fin (final) message must be received in order for the buffer to be cleared for use.
How the two protocols work is illustrated below:
Tag Matching Implementation
There are two types of matching objects used in a tag matching implementation: the posted receive list and the unexpected message list. The application posts receive buffers through calls to the MPI receive routines in the posted receive list and posts send messages using the MPI send routines. The head of the posted receive list can be maintained by the hardware, with the software expected to shadow this list. When a send message is initiated and arrives at the receive side can there is no pre-posted receive message for this arriving message, the send message is passed to the software and placed in the unexpected message list. If there is a set of matching messages, the match is processed, including Rendezvous processing, if appropriate. Rendezvous processing delivers the data to the specified receive buffer, which allows overlapping receive-side MPI tag matching with computation.
When a receive-message is posted, the communication library first checks the software's unexpected message list for a matching receive message. If a match is found, data is delivered to the user buffer, using a software-controlled protocol. The UCX implementation uses either an Eager or Rendezvous protocol, depending on data size. If a match is not found, the entire pre-posted receive list is maintained by the hardware, which frees up space to add one more pre-posted receive to this list. This receive message is passed to the hardware. The software is expected to shadow this list, to help with processing MPI cancellation operations. Because hardware and software are not expected to be tightly synchronized with respect to the tag-matching operation, this shadow list is used to detect when a pre-posted receive is passed to the hardware, as the matching unexpected message is being passed from the hardware to the software.
All tag matching happens on the receive side, since the use of wildcards for the source rank makes the process of tag matching on the send side very expensive.
Tag matching is being triggered either when an application posts a new receive message or when a new message arrives on the network.
This matching processed in one of three ways:
- Only in software
- Only in hardware
- Split between multiple address.
If the process is split, access to the two lists must be handled in a critical region.
The next figure shows an incorrect error flow example of a tag being matched twice when the tag matching process is split between two address spaces and when there is no critical region for updating to the two lists.
Verbs API Overview
This section presents the support for distribution of matching lists across multiple address spaces. The interface is kept general so that any implementation can use the offered functions. For this reason, the interface to the communication library (the user) will be responsible with handling all memory registration and allocation of temporary buffers as well as with the implementation of a specific tag matching algorithm. The Verbs interface will provide functions for communication and synchronization between address spaces.
From a user interface perspective, the verbs offer the following interface capabilities. They:
- Pass posted receives to the external context
- Receive posted receives from the external context
- Pass unexpected messages back to the external context
- Receive unexpected messages from the external context
- Revoke posted receives from the external context
- Revoke unexpected messages from the external context
- Sync contexts
- Receive match notification from the external context
- Configure and replenish the external resources
The following diagram presents the workflow for a hardware/software tag matching architecture using the verbs interface capabilities, described above.
The diagram presents the tag matching workflow, following the application from the point when a remote process posts a send message that specifies the tag, communicator, source, destination, and provides a pointer to the buffer with the data. The communication library is responsible for creating the QPs, allocating space for the temporary buffers and registering memory. Afterwards, the communication library uses verbs functions to create the Tag Matching Header (TMH). After it merges this header with the payload, it sends the data over the network to the receiver side.
The communication library handles all unexpected tag matching. The expected receive tag matching logic can be split between the communication library and the tag matching hardware. If the communication library uses external context (hardware) tag matching, the head of that posted-receive list must be "owned" by the external context. Further, the communication library must track the external context, so that it can handle operations (such as a MPI receive cancel) correctly.
On the receive side, the application posts a receive specifying the tag, communicator and the source rank as well as the buffer that will be used to store the data. If the message is not matched against unexpected messages, the communication library uses verb functions to hand off the post receive to the external context (hardware) and to synchronize between contexts.
If the send arrives at the receiver before the post receive is being called by the application, the hardware is unable to find a match in the head of the posted-receive list, so the message is handed off to the software, which ensures the corresponding receive was not posted to the hardware in the meantime, matches against the tail of the posted receive list, and if needed it will be added it to the unexpected message list.
When using the rendezvous protocol, the communication library on the send side will register the user buffer memory instead of allocating a new buffer. If the hardware matches the incoming message, it is responsible with completing rendezvous protocol. If the communication library completes the match, it is responsible for completing the rendezvous protocol.
The verbs interface is only responsible with the communication between the communication library and hardware, everything else falls within the communication library’s responsibility, in coordination with the external context.
Tag matching workflow diagram legend
- Blue – Application
- Green – Communication library
- Orange – Verbs
- Grey – Hardware
For Tag Matching Verbs API and Code example, refer to Tag Matching Verbs API and Implementation Example.