Understanding On Demand Paging (ODP)

Version 5

    On-Demand-Paging (ODP) is a technique to ease the memory registration. Applications do not need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA (Host Channel Adapter) requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes.

    This post is aimed for IB verbs developers.

     

    References

    ODP Types

     

    ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.

     

    Explicit ODP

    In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages.

    ODP Memory Region (MR) does not need to have valid mappings at registration time.

     

    Implicit ODP

    In Implicit ODP, applications can create a memory key that covers the entire address space of a process. This relieves the application from the burden of memory registration

    as it allows it to use a single memory key for all IO accesses.

    Query ODP Capabilities

    On-Demand Paging is available if both the hardware and the kernel support it. To verify whether ODP is supported, run ibv_exp_query_device:

    struct ibv_exp_device_attr dattr;

    dattr.comp_mask = IBV_EXP_DEVICE_ATTR_ODP | IBV_EXP_DEVICE_ATTR_EXP_CAP_FLAGS;

    ret = ibv_exp_query_device(context, &dattr);

    if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)

    //On-Demand Paging is supported.

     

    Each transport has a capability field in the dattr.odp_caps structure that indicates which operations are supported by the ODP MR:

    struct ibv_exp_odp_caps {

    uint64_t general_odp_caps;

    struct {

    uint32_t rc_odp_caps;

    uint32_t uc_odp_caps;

    uint32_t ud_odp_caps;

    uint32_t dc_odp_caps;

    uint32_t xrc_odp_caps;

    uint32_t raw_eth_odp_caps;

    } per_transport_caps;

    };

     

    To check which operations are supported for a given transport, the capabilities field need to be masked with one of the following masks:

    enum ibv_odp_transport_cap_bits {

    IBV_EXP_ODP_SUPPORT_SEND = 1 << 0,

    IBV_EXP_ODP_SUPPORT_RECV = 1 << 1,

    IBV_EXP_ODP_SUPPORT_WRITE = 1 << 2,

    IBV_EXP_ODP_SUPPORT_READ = 1 << 3,

    IBV_EXP_ODP_SUPPORT_ATOMIC = 1 << 4,

    IBV_EXP_ODP_SUPPORT_SRQ_RECV = 1 << 5,

    };

     

    For example to check if RC supports send:

    If (dattr.general_odp_caps.per_transport_caps.rc_odp_caps & IBV_EXP_ODP_SUPPORT_SEND)

    //RC supports send operations with ODP MR

     

    MR Registration

    Explicit ODP MR Registration

    ODP Explicit MR is registered like any other MR after allocating the necessary resources (e.g., PD, buffer).

    The user indicates that the requested MR is an ODP MR by setting the  IBV_EXP_ACCESS_ON_DEMAND bit in ibv_exp_reg_mr_in.exp_access:

    struct ibv_exp_reg_mr_in in;

    struct ibv_mr *mr;

    in.pd = pd;

    in.addr = buf;

    in.length = size;

    in.exp_access = IBV_EXP_ACCESS_ON_DEMAND| … ;

    in.comp_mask = 0;

    mr = ibv_exp_reg_mr(&in);

     

    Note that the exp_access differs from one operation to the other, but the IBV_EXP_ACCESS_ON_DEMAND is set for all ODP MRs.

    For further information, please refer to the ibv_exp_reg_mr manual page.

     

    Implicit ODP MR Registration

    Registering an Implicit ODP MR provides you with an MR that covers the entire address space of the process.

    To register an Implicit ODP MR, in addition to the IBV_EXP_ACCESS_ON_DEMAND access flag, use in->addr = 0 and in->length = IBV_EXP_IMPLICIT_MR_SIZE.

    For further information, please refer to the ibv_exp_reg_mr manual page.

     

    Older versions of MLNX_OFED emulated an implicit ODP MR in software. The emulated implicit ODP MR supported local operations only and was restricted scatter gather entries to 128MB.

    Those restrictions were removed in ConnectX-4.

     

    To verify whether hardware based Implicit ODP MR is supported, run ibv_exp_query_device as described above and check the following capabilities.

    if (dattr.general_odp_caps & IBV_EXP_ODP_SUPPORT_IMPLICIT)

         //implicit MR is supported

     

    De-Registration of ODP MR

    ODP MR is deregistered the same way a regular MR is deregistered:

    ibv_dereg_mr(mr);

     

    Pre-fetching Verb

    The driver can pre-fetch a given range of pages and map them for access from the HCA. The pre-fetched verb is applicable for ODP MRs only, and it is done on a best effort basis, and may silently ignore errors.

    Example:

    struct ibv_exp_prefetch_attr prefetch_attr;

    prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS;

    prefetch_attr.addr = addr;

    prefetch_attr.length = length;

    prefetch_attr.comp_mask = 0;

    ibv_exp_prefetch_mr(mr, &prefetch_attr);

    For further information, please refer to the ibv_exp_prefetch_mr manual page.

     

     

    ODP Statistics

    To aid in debugging and performance measurements and tuning, ODP support includes an extensive set of statistics. The statistics are divided into 2 sets: standard statistics and debug statistics.

    Both sets are maintained on a per-device basis and report the total number of events since the device was registered.

     

    The standard statistics are reported as sysfs entries with the following format:

    # ls /sys/class/infiniband_verbs/uverbs[0/1]/

    invalidations_faults_contentions

    num_invalidation_pages

    num_invalidations

    num_page_fault_pages

    num_page_faults

    num_prefetchs_handled

    num_prefetch_pages

    ...

     

    CounterDescription
    invalidations_faults_contentionsNumber of times that page fault events were dropped or prefetch operations were restarted due to OS page invalidations.
    num_invalidation_pagesTotal number of pages invalidated during all invalidation events.
    num_invalidationsNumber of invalidation events.
    num_page_fault_pagesTotal number of pages faulted in by page fault events.
    num_page_faultsNumber of page fault events.
    num_prefetches_handledNumber of prefetch verb calls that were completed successfully.
    num_prefetch_pagesTotal number of pages that were prefetched by the prefetch verb.

     

    The debug statistics are reported by debugfs entries with the following format:

    # ls /sys/kernel/debug/mlx5/<pci-dev-id>/odp_stats/

    num_failed_resolutions

    num_mrs_not_found

    num_odp_mr_pages

    num_odp_mrsCounter

     

    CounterDescription
    num_failed_resolutions

    Number of failed page faults that could not be resolved due to non-existing mappings in the OS.

    num_mrs_not_foundNumber of faults that specified a non-existing ODP MR.
    num_odp_mr_pagesTotal size in pages of current ODP MRs.
    num_odp_mrsNumber of current ODP MRs.