Tag Matching Verbs API and Implementation Example

Version 11

    This post presents the tag matching (TM) Verbs API. Verbs provide an abstract description of the functionality of a network adapter. By using the verbs, users can create / manage objects that are needed in order to use RDMA for data transfer.

    In Tag Matching, the software holds a list of matching entries called a “matching list”. The Matching List is used to steer arriving messages to a specific buffer according to the message tag.

    This post is intended for use by developers.





    Tag Matching Generic Verbs API


    Query Device APIs


    ibv_query_device_ex() retrieves the various attributes used in tag matching.

    int query_device_ex(struct ibv_context *context,

                                          const struct ibv_query_device_ex_input *input,

                                          struct ibv_device_attr_ex *attr,

                                          size_t attr_size);




    The ibv_device_attr_ex structure includes information about the tag matching capabilities:

    struct ibv_device_attr_ex {

                   struct ibv_device_attr      orig_attr;

                   uint32_t                    comp_mask;

                   struct ibv_odp_caps         odp_caps;

                   uint64_t                    completion_timestamp_mask;

                   uint64_t                    hca_core_clock;

                   uint64_t                    device_cap_flags_ex;

                   struct ibv_tso_caps         tso_caps;

                   struct ibv_rss_caps         rss_caps;

                   uint32_t                    max_wq_type_rq;

                   struct ibv_tm_caps          tm_caps; /* explained below */





    ibv_tm_caps is used to get information about tag matching capabilities The ibv_tm_caps
    structure (struct) is used with ibv_device_attr_ex.

    struct ibv_tm_caps {

                   uint32_t max_tag_size;           /* Characteristics of the receive mask (given in bits) */

                   uint32_t max_header_size_eager;  /* The maximum size for the TM header when using Eager */

                   uint32_t max_header_size_rndv;   /* The maximum size for the TM header when using Rendezvous */

                   uint32_t max_header_size_notag;  /* The maximum size for the TM header when not using tags */

                   uint32_t max_priv_size;          /* The size of the application context field in the XRQ context*/

                   uint32_t max_rndv_priv_size;     /* Max size of the information passed after the RNDV header */

                   uint32_t max_num_tags;           /* Posed receive maximum list size */

                   uint32_t capability_flags;       /* TM capabilities mask - enumerated below in ibv_tm_flags */

                   uint32_t max_tag_ops;            /* Max number of outstanding operations */




    1. If cap.max_num_tags = 0, there is no tag matching support.

    2. The API assumes there is default support for the Eager protocol and expected cases.




    ibv_tm_flags identifies flags used with ibv_tm_caps.


    enum ibv_tm_flags{

                   IBV_TM_CAP_NO_TAG = 1,            /* The HW supports messages without tag sent on QPs attached to a SRQ */

                   IBV_TM_CAP_RNDV = 1 << 1,        /* The HW supports tag matching for rendezvous messages when the send arrives after the corresponding receive */



    SRQ Extensions


    The following APIs extend the Shared Receive Queue (SRQ) to represent a tag matching context. Use the extended SRQ create function to include objects needed for tag matching. The SRQ Completion Queue (CQ) can now be used for polling the completion of the tag matching. The SRQ of tag matching type will be used for sending data with tags and for rendezvous (Remote Direct Memory Access [RDMA READ]) operations. The SW counter, the command QP, and the hardware Mellanox Shared Receive Queue (XRQ) identifiers are hidden in the device structures.




    ibv_create_srq_ex provides extensions to ibv_create_srq, which creates a shared receive queue (SRQ). srq_attr->max_wr and srq_attr->max_sge are read to determine the requested size of the SRQ and set the actual values allocated on return.


    struct ibv_srq_ex is defined as follows:

    struct ibv_srq * ibv_create_srq_ex(struct ibv_context *context, struct ibv_srq_init_attr_ex *srq_init_attr_ex)




    mlx5_srq enables users to specify the attributes of the Mellanox Shared Receive Queue (XRQ).

    struct mlx5_srq {

    struct verbs_srq       vsrq;

    struct mlx5_buf        buf;

    struct mlx5_spinlock   lock;

    uint64_t               *wrid;

    uint32_t               srqn;                /* Will store the XRQ number */  

    int                    max;

    int                    max_gs;

    int                    wqe_shift;

    int                    head;

    int                    tail;

    uint32_t               *db;

    uint16_t               counter;

    int                    wq_sig;

    struct ibv_srq_legacy  *ibv_srq_legacy;

    uint16_t               tm_phase_cnt;       /* Will store the SW counter */

    struct ibv_qp          cmnd_qp;            /* The QP used for commands to the posted receive list */





    ibv_srq_type enables users to define the SRQ type as basic, Mellanox SRQ, or Tag Matching.

    enum ibv_srq_type {








    ibv_srq_init_attr_mask is used to enable mask attributes.

    enum ibv_srq_init_attr_mask {

    IBV_SRQ_INIT_ATTR_TYPE            = 1 << 0,

    IBV_SRQ_INIT_ATTR_PD              = 1 << 1,

    IBV_SRQ_INIT_ATTR_XRCD            = 1 << 2,

    IBV_SRQ_INIT_ATTR_CQ              = 1 << 3,

    IBV_SRQ_INIT_ATTR_TM              = 1 << 4,

    IBV_SRQ_INIT_ATTR_TM_DC           = 1 << 5,

    IBV_SRQ_INIT_ATTR_RESERVED        = 1 << 6};




    ibv_srq_init_attr_ex is a structure that is used to initialize attributes.

    struct ibv_srq_init_attr_ex {

                   void                     *srq_context;   

                   struct ibv_srq_attr       attr;

                   uint32_t                  comp_mask;    

                 enum ibv_srq_type         srq_type;      

                   struct ibv_pd            *pd;            

                   struct ibv_xrcd          *xrcd;        

                   struct ibv_cq            *cq;

                   struct ibv_srq_tm_caps   *tm;         

                   struct ibv_dci_init_attr *tm_dc;





    ibv_srq_tm_caps is a structure used to specify the Tag Matching SRQ capabilities.

    struct ibv_srq_tm_caps {

           uint32_t        max_num_tags;    /* TM matching list size */

           uint32_t        max_tm_ops;      /* Number of outstanding operation */





    ibv_dci_init_attr is a structure used to initialize attributes for the Rendezvous protocol’s Dynamically Connected Initiator (DCI).

    struct ibv_dci_init_attr {            /* Information required by the firmware to create the rendezvous DCIs */

           uint8_t                        min_rnr_timer;

           uint32_t                       flow_label;

           uint8_t                        hop_limit;

           uint32_t                       inline_size;

           uint32_t                       create_flags;




    1. The ibv_create_srq_ex function must create a command QP used for inserting/removing entries from the posted receive list. This command Queue Pair (QP) is hidden from users in the mlnx5_srq structure.

    2. The fields in ibv_srq_tm_caps are input/output parameters. A number is requested when the create_srq function is called, and the actual allocated tags/ops are returned in the same structure.

    3. The corresponding function that destroys/modifies/queries the SRQ will be unmodified.


    Sender Side




    ibv_pack_tm_info is a helper function used to create the send Tag Matching  header. This function is used to prepare the
    header that needs to be added to the payload. The function returns the size of the header buffer.

    inline uint32_t ibv_pack_tm_info(void *buf, ibv_tm_info *tm);




    Ibv_rndv_data is a function used to locate data managed by the Rendezvous protocol.

    struct ibv_rndv_data {

            uint32_t   rkey; /* Remote memory key for the RDMA transaction */

            uint64_t   vaddr; /* Virtual address of remote data */

            uint32_t   len; /* The RDMA transaction length */





    TM_OP is a function used to enumerate tag matching operations for each protocol type.

    enum TM_OP {










    ibv_tm_tag is a structure used to specify tag matching tags.

    union ibv_tm_tag {

            uint64_t tag;

            char *tag_ptr;





    ibv_tm_priv is a structure used to specify private tag matching data.


    union ibv_tm_priv {

            uint64_t data;

            char *data_ptr;





    ibv_tm_info is a structure used to identify tag information.

    struct ibv_tm_info {

            uint8_t               op;        /* Values from the TM_OP enum */

            uint8_t               sync;      /* Bit to identify synchronous sends */

            union ibv_tm_tag      tag;       /* Tag information */

            union ibv_tm_priv     tm_priv;   /* Application context */

            struct ibv_rndv_data  rndv;      /* Information for performing the rendezvous */

            uint32_t              comp_mask;

            ibv_tm_dc *dc;                   /* Fields used only for DC transport protocols */





    Ibv_tm_info_mask is a structure used to specify mask characteristics.

    enum ibv_tm_info_mask {

            IBV_TM_INFO_DC            = 1 << 0,

            IBV_TM_INFO_RESERVED      = 1 << 1};




    ibv_tm_dc is a structure used to specify tag matching DC data.

    struct ibv_tm_dc {

            uint64_t   dc_access_key;        /* Access key to be used in rendezvous completion packet */

            uint32_t   dct_num;              /* Target for rendezvous completion */

            uint8_t    sl;                   /* sl to be used for rendezvous completion */




    1. The function returns the size of the header buffer.

    2. The user needs to allocate the header buffer for the worst case scenario (tag matching + rendezvous + DC headers).

    3. The post sender uses the header buffer followed by the data (this message must be created by the user by merging the buffered returned by the helper function before the payload).


    Receiver Side


    QP/DCT Creation






    ibv_create_qp is a function used to create Queue Pairs. The ibv_create_qp function
    imposes some extra restrictions when linked to a TM SRQ.

    struct ibv_qp *ibv_create_qp (struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);




    ibv_create_qp_ex provides extensions to the ibv_create_qp function.

    struct ibv_qp * ibv_create_qp_ex (struct ibv_context *context, struct ibv_qp_init_attr_ex *qp_init_attr_ex)




    ibv_qp_init_attr_ex function is used to initialize Queue Pair attributes.

    struct ibv_qp_init_attr_ex {

            void                   *qp_context;    

            struct ibv_cq          *send_cq;     /*The send_cq will be reserved and cannot be used for the IBV_SRQT_TM type */

            struct ibv_cq          *recv_cq;     /* The recv_cq is used only for the IBV_SRQT_BASIC type*/

            struct ibv_srq         *srq;         

            struct ibv_qp_cap      cap;          

            enum ibv_qp_type       qp_type;      /* When the SRQ type is TM, the qp_type needs to be limited to DC and RC, any other values would fail the creation */      

            int                    sq_sig_all;    

            uint32_t               comp_mask;

            struct ibv_pd          *pd;

            struct ibv_xrcd        *xrcd;

            uint32_t               create_flags;





    Ibv_exp_dct is a structure used to create the Dynamically Connected Transport.

    struct ibv_exp_dct *(*create_dct)(struct ibv_context *context, struct ibv_exp_dct_init_attr *attr);



    1. The CQ from the tag matching SRQ structure is used for sending an alert upon the completion of the tag matching, on success/fail of the receive operations, and for rendezvous operations.

    2. For RC QPs, when the chosen protocol is Rendezvous (noted in the protocol field of the tag matching header), the send side of the QP is in device ownership and is used for RDMA.

    3. For DC, when a DCT is connected to a tag matching SRQ, the firmware uses dedicated DCIs (created at the time of the XRQ creation) to be used for rendezvous completion.


    Passing List Operations to the XRQ





    ibv_post_srq_ops defines a new function used to include list operations to the XRQ. Adding a new entry (opcode ADD) returns a handler to a unique id for the given receive. Note that this identifier can be used by the user to find/remove entries from the posted receive list. The function uses the command Queue Pair (QP) internally created during the
    creation of the SRQ and stored in the mlx5_srq structure.

    int ibv_post_srq_ops(struct ibv_srq *srq, struct ibv_ops_wr *wr, struct ibv_ops_wr **bad_wr);




    ibv_ops_wr is used to locate WR operations and obtain information about them.

    struct ibv_ops_wr {

            uint64_t                wr_id;     /* User defined WR ID */

            struct ibv_ops_wr_ex    *next;     /* Pointer to next WR in list, NULL if last WR */

            struct ibv_sge          *sg_list;  /* Pointer to the s/g array */

            int                     num_sge;   /* Size of the s/g array */

            int                     opcode;

            int                     op_flags;  /* Standard send flags */

            union {

                        struct tm_data {

                                 union ibv_tm_tag tag;

                                 union ibv_tm_tag mask;

                                 uint32_t unexp_cnt; /* Number of unexpected messages handled by SW */

                                 uint32_t handle; /* Input parameter for the DEL opcode and output parameter for the ADD opcode */

                       } tm;     

            } wr;





    ibv_wr_opcode is used to specify OPCODE operations used for WR.


    enum ibv_wr_opcode {








    1. Using the ADD opcode, the function will return the index in the posted receive list into the handle pointer. This ID should be stored by the communication library, so it can be used later to call the function with a DEL opcode to remove a posted received entry from the posted receive list.

    2. Each function updates the hardware and software counters. This information is internal and is not visible to the user. The number of matches done in software will be used to update the counters.

    3. The SRQ used for list operations must be a TM SRQ. Otherwise, the function will return an error code.


    Post Buffers for Unexpected Messages

    Buffers for unexpected messages are posted normally to the SRQ.




    ibv_post_srq_recv() is used to post received messages to the Shared Receive Queue.



    int ibv_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *recv_wr, struct ibv_recv_wr **bad_recv_wr)


    Unpacking Headers from Unexpected Messages


    The tag matching header will be extracted using a helper function. The tm data is the same as the structure used by the helper function that created the header added to the payload. The extraction function populates the fields in ibv_tm_info. The function returns the offset where the payload starts.


    ibv_unpack_tm_data ()


    ibv_unpack_tm_data () is a structure used to unpack tag matching data.


    inline int ibv_unpack_tm_data (ibv_srq *srq, void * buffer, ibv_tm_info *tm);


    Poll for Completion


    These functions provide an extended version of the ibv_cq structure. These functions enable the reading of new data related to tag matching. In addition, the polling process looks for different types of Completion Queue Entries (CQE)..




    ibv_poll_cq is used to poll the completion queue.

    int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);




    ibv_wc_tm_info is a function used to obtain work completion tag matching information.

    struct ibv_wc_tm_info{

            union ibv_tm_priv tm_priv; /* Application context field scattered from WRs to CQEs */

            union ibv_tm_tag sender_tag; /* 64b tag from the packet */

    } ;




    ibv_create_cq_ex is a function used to create the tag matching (TM) completion queue (CQ).


    struct ibv_cq_ex {


      struct ibv_wc_tm_info (*read_wc_tm_info)(struct ibv_cq_ex *current);





    Ibv_create_cq_wc_flags is the function used to create Completion Queue work completion (WC) flags.


    enum ibv_create_cq_wc_flags {


      IBV_WC_EX_WITH_TM_INFO = 1<<8,






    ibv_wc_opcode is a function used to specify work completion (WC) opcodes.

    enum ibv_wc_opcode {








            IBV_WC_RECV                     = 1 << 7,



            IBV_WC_TM_RECV,                /* Tag matched and/or msg completion CQEs in one */

            IBV_WC_TM_ADD,                 /* List operation was completed - add */

            IBV_WC_TM_DEL,                 /* List operation was completed - remove */

            IBV_WC_TM_SYNC,                /* List operation was completed - synchronization */






    ibv_wc_status is a function used to determine work completion status.

    enum ibv_wc_status {



           IBV_WC_RNDV_INCOMP_ERR, /* Meta-data is in user-buffer */





    ibv_wc_flags is a function used to specify work completion flags.

    enum ibv_wc_flags {


            IBV_WC_TM_MATCH        = 1 << 4,

            IBV_WC_TM_DATA_VALID   = 1 << 5,

            IBV_WC_TM_SYNC_REQ     = 1 << 6, /* Flag used when synchronization is required (counters not equal)*/




    1. The read_wc_tm_info function can only be used if the IBV_WC_TM_MATCH flag is set.

    2. For the IBV_WC_TM_RECV opcode, the WC wc_flags can have one of the following values:

    • TM_MATCH
    • No flags means unexpected messages

    3.  For the IBV_WC_TM_RECV opcode, WC status can return:

    • ERROR
    • RNDV_INCOMPLETE   /* Meta-data is in user-buffer */


    Communication Library Implementation Example


    Initiator Side

    1. Create Queue Pairs (QP).

    // create normal communication QP

    struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr)


    2. Create the Tag Matching Header (TMH).

    // create ibv_tm_info *tm with header information (TM, RNDV, DC)

    int size = ibv_pack_tm_info(buf, tm);

    // merge buf with the payload – create work requests *wr


    3. Send.

    ibv_post_send(qp, wr, bad_wr);


    4. Poll for completion.

    ibv_poll_cq_ex(send_cq, 1, wc);

    If the protocol is RENDEZVOUS, wait for the final (fin) message.


    Target Side



    1. Begin initializations.

    // Query device to inquire about the HW capabilities of providing TM

    rc = ibv_query_device(ctx, &device_attr);


    // Allocate the posted receive list (device_attr->tm_caps. max_tag_size entries)


    2. Create the TM SRQ.

    // create a CQ use ibv_create_cq_ex in order to have access to the struct ibv_wc_tm_info during polling

    //create the SRQ attributes structure attr filling information for struct ibv_srq_tm_caps *tm and struct

    ibv_dci_init_attr *dc

    attr.srq_type = IBV_SRQT_TM;

    // attach the created CQ to attr.cq

    struct ibv_srq *srq = ibv_create_srq_ex(context, &attr);


    3. Create a Queue Pair (QP).

    //create the QP, set qp_init_attr->srq to point to  srq

    // qp_type is limited to DC and RC, any other values would fail the creation

    struct ibv_qp *qp = ibv_create_qp(pd,  &qp_init_attr)


    4. Post buffers for the unexpected Work Queue (WQ).

    // allocate memory for the buffers, create the wr

    ibv_post_srq_recv(srq, &wr, &bad_wr)




    Case 1: The Application Posts a New Received Message.

    // SW check for a match in the unexpected message list for the new received message

    // Two cases: 1. HW has placed the corresponding send to the unexpected message list; 2. The corresponding post send has not yet been received by the HW

    // Parse the unexpected message list and look for a match (stop at the first found)

    // If Msg is not in the unexpected msg list

    // Create ops_WRs *wr in struct ibv_ops_wr with information for the tm_data structure

    // set the match bits and mask and the number of matches done in SW (unexp_cnt)

    // Use opcode IBV_WR_TM_ADD

    ibv_post_srq_ops(srq, wr, bad_wr)

    // int recv_id  = handle returned in the tm_data structure

    // if necessary use the srq.cq to poll for ADD completion

    ibv_poll_cq_ex(srq->cq, 1, wc);    


    Case 2: An Unexpected Message is Received.

    // if unexpected message received, check for a match (on receive increase the unexp_cnt)

    // If there is a match use the remove function to remove the posted receive list from the hardware

       // Create ops_WRs *wr in struct ibv_ops_wr with information for the tm_data structure

       // set the handle to correspond to the corresponding recv_id and the unexp_cnt

       // Use opcode IBV_WR_TM_DEL

       ibv_post_srq_ops(srq, wr, bad_wr)