Understanding Erasure Coding Offload

Version 9

    This post describes the Reed-Solomon Erasure Coding hardware offload feature supported on Mellanox ConnectX-4 adapters.

    This post is for advanced users and meant for developers.

     

    Note: This feature is not supported on ConnectX-3/ConnectX-3 Pro family.

     

    References

     

    What is Reed-Solomon Erasure Coding?

    Erasure coding is a mathematical method to encode data in a way that it can be recovered in case of disk failures.

    For those without any background on storage recovery, it is advised to watch this video and this presentation that explains at high level the algorithm and supplies examples and illustrations.

     

    Erasure Coding Offload Programming Models

    There are different programming models that an application can choose to implement RAID and Erasure Coding (EC) offload. We will show an example of 5-2 coding (5 data blocks and 2 calculated redundancy blocks).

    Erasure Coding and Decoding hardware offload is supported by Mellanox ConnectX-4 adapters.

     

    Software Calculations (No offload)

    Most applications today perform Erasure Coding calculations in software then send the data and redundancy blocks to the relevant nodes/disks (OSDs).

    In these cases the calculations are performed by the CPU and, therefore, there is no hardware offload. CPU utilization in this case will be high as well as IO operations.

    1.png

     

     

     

    Hardware Offload

    Using Mellanox ConnectX-4 adapters, Erasure Coding calculations can be offloaded to the adapter's ASIC.

    There are several models of hardware offload as described in below.

     

    Synchronous Encode Calculations

    Synchronous EC calculation can be used by existing applications simply by replacing a JErasure-like API to use the hardware driven EC calculation.

     

    2.png

     

     

    Programming Steps:

    1. Call encode_sync(data, code, block_size) API – blocking.
    2. Send the data buffers to the corresponding nodes (no ordering dependency on step 1).
    3. Once encode_sync() returns, send the code buffers to the corresponding nodes.

     

    The advantages here are:

    • Easy code conversion to work with a hardware driven EC calculation
    • CPU utilization savings as the HCA offloads the EC calculation

     

    Note: Latency and/or message-rate and/or bandwidth are less likely to improve directly.

     

    Asynchronous Encode Calculations

    Asynchronous EC calculation is also possible where the application can post an EC calculation operation and get an event notification when the calculation is done. This model allows the application to become more efficient as at the time it is waiting for an EC calculation, it is free to compute or execute or service any other job.

    When the adapter completes the calculation, it notifies the application so it can continue to send the data and coding blocks to the relevant peers (OSDs).

     

    3.png

     

     

    Programming Steps:

    1. Post encode_async(data, code, block_size, …).
    2. Send the data buffers to the corresponding nodes (no ordering dependency on step 1).
    3. Wait for encode_async to complete (asynchronously), free to continue with tasks execution meanwhile.
    4. Send the code buffers to the corresponding nodes.

     

    Unlike the synchronous model, the call to async encoding is non-blocking and is a fast operation. The calling thread provides a done() function pointer for post-calc execution and continues with other compute operations. It may, for instance, post the sending of the data blocks at this stage. Once the calculation is completed, the done() function passed by the calling thread is executed. The done() function is expected to trigger the sending of the code blocks.

     

    Asynchronous Encode and Send Calculations

    Generally, EC is used to spread an object store across multiple nodes (OSDs) in a cluster -- this operation is called striping an object. Applications using this model will be able to post a compound job that includes the entire striping operation to the adapter and continue to execute the next task. The adapter will offload both the EC calculation and send the corresponding blocks to the corresponding nodes.

    4.png

     

    Programing Steps:

    1. Post a compound operation encode_send(data, code, block_size, nodes, …).
    2. Free to continue with tasks execution.

     

    The completion of the EC calculation is implicit. The entire transaction is considered as completed after all the individual data transfer operations are completed. The user might want to signal the corresponding SEND operations or use any other way that guarantees that the data-transfer is completed successfully and stored in the storage media.

     

    The advantage of this method is reduced latency and message rates (IO operations) as the application saves a SW interrupt and cache-line bounces (due to completion context execution) which exists in the former two models. However, the conversion of existing applications to work in a fully offloaded striping operation is less trivial.

     

    Example Code is attached below.