Bonding Considerations for RDMA Applications

Version 6

    This post discuss some of the bonding considerations for RDMA (over Ethernet or IB).

     

    • Mellanox kernel network drivers support bond on both EN and IPoIB. This is relevant for all socket based TCP/UDP IP traffic working on top of the bonded interface.
    • Today there is no MLNX_OFED driver or FW level for bond support for any user space RDMA RC QP type (RoCE or IB). Each solution needs to be designed and written in a way to take into consideration the bond events and fail-over and logic. In other words, running RDMA applications over bond interface will work but will probably stop working on any link failure events, unless the applications have knowledge to handle bond failures (e.g. QP or Port errors). The kernel's bond support is not relevant for any type of RDMA solution (like iSER, MPI, UDA or R4H).
    • All application can work over bonding interface. However, when failover occurs there are 2 types of applications – those which react to RDMA_CM_EVENT_ADDR_CHANGE and those which ignore it. Those that react will reconnect (meaning current QP will be destroyed) and those that don’t will completely fail.

     

     

    Application Examples:

    • Most MPI applications do not have the logic to handle bond failures. upon link failure, the user will need to restart the application.

     

    • Running iSER over a bonded (active/passive) interface will work. iSER implements auto-reconnect on every failure, so once the bond driver switches to the other active port, iSER will establish a new RC QP and continue working.
      The required bonding definition is: "active/passive with fail_over_mac=1".

     

    • Running UDA (MapReduce acceleration) over a bonded (active/passive) interface will work. However, the current UDA (RDMA RC QP) does not implement any reconnect of fail-over logic to support any bonding events.
      Upon a link failure the bond interface will get updated, but UDA will not follow. MapTasks will continue to work and their TCP connection will recover immediately. but only after a timeout without updates, the Hadoop framework will re-spine the ReduceTasks and then UDA will re-connect and continue to operate. This re-spine of a ReduceTask (with UDA) will induce a big latency penalty to the job as MapReduce is an all-to-all job, so even a single port failure will cause all tasks in the job to restart! While vanilla (TCP sockets) will just have a small hiccup.

      The required bonding definition is: "active/passive with fail_over_mac=1".

      • Note: there is an option to play/tune with the re-lunch timer of the MapReduce. The parameter is called mapreduce.task.timeout and defined as the number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. Be aware that some tasks may perform long calculations before performing actions. The default Apache Hadoop value is 10 minutes (600000 milliseconds).

         

    • Running R4H (HDFS acceleration) will work over bonded (active/passive) interface. Upon failure R4H will indicate this failure to the upper application layer. Depending if it supports reconnect to the HDFS the R4H will follow.
      R4H is built above JXIO/AccelIO which has reconnect logic built in. Next release will enable this by default to add the auto-reconnect logic into the R4H layer.
      The required bonding definition is: "active/passive with fail_over_mac=1".

     

    • VMA support bonded (active/passive) interfaces and implements the fail-over logic. Any socket application loaded with VMA have the same experience as if it would run over the OS sockets.
      We plan to add Active-Active (LAG) support in one of the coming releases.
      Note: VMA uses UD QP (on IB) and RAW QP (on Ethernet)