Understanding MPI Tag Matching and Rendezvous Offloads (ConnectX-5)

Version 8

    Tag Matching and Rendezvous Offloads is a technology employed by Mellanox to offload the processing of MPI messages from the host machine onto the network card. Employing this technology enables a zero copy of MPI messages, i.e. messages are scattered directly to the user's buffer without intermediate buffering and copies. It also provides a complete rendezvous progress by Mellanox devices. Such overlap capability enables the CPU to perform the application's computational tasks while the remote data is gathered by the adapter.

    This feature is available in ConnectX-5 IC.

     

    References

     

    In MPI, send/receive operations are identified with an envelope typically composed of Tag, communicator and source. The envelope is used to match a given message to its corresponding user buffer. The whole list of buffers posted by a given process is referred to as the matching list. The process of finding the corresponding buffer from the matching list to a given message is referred to as Tag matching.

     

    MPI_Send( void* data, int count, MPI_Datatype datatype, int destination, int tag, MPI_Comm communicator)

    MPI_Recv( void* data, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm communicator, MPI_Status* status)

     

    There are two common protocols for message passing:

    • Eager – in Eager protocol, the message, with all of its data, is sent directly to the target. Eager protocol is mostly used for small-medium sized messages.
    • Rendezvous – in Rendezvous protocol, the initiator of the transaction sends a small descriptor describing its intention to send data. The target will fetch the data from the initiator when it has a matching buffer.  Rendezvous is mostly used for large messages.

     

    The Tag Matching and Rendezvous Offloads are defined for IB/RoCE transports.

     

    Tag Matching Offload enables the process to push the head of the matching list to the NIC to perform Tag Matching on it. The adapter will process MPI messages and perform the Tag Matching.

    • If a matching buffer is found, the message will be scattered directly to the user's buffer.
    • If no matching buffer was found, the message will be scattered to a generic buffer and will be passed to SW to complete the Tag Matching on the rest of the matching list.

     

    To summarize:

    • Tag matching offload in its software implementation, is designed to be beneficial by posting the matching buffer before the message's arrival.
    • Rendezvous offload extends the Tag Matching capabilities. With this capability, ConnectX-5 is able to identify Rendezvous protocol messages, gather the remote data and scatter it to the matching buffer without any software intervention. In the software implementation of Rendezvous the remote data can be gathered only when the SW explicitly calls the MPI library, creating dependency between the initiator and the target for the data transfer. With the rendezvous progress overlap by Mellanox devices, the data transfer is one-sided, saving valuable CPU cycles to be used for the application's computational tasks.

     

    To learn more about Tag Matching, refer to: