RDMA Improves HDFS Write CPU Utilization by 40%

Version 41

    This post demonstrates acceleration benchmark results collected while comparing TCP versus RDMA connectivity using Apache Hadoop HDFS (Hadoop Distributed File System).

    This post assumes familiarity with RDMA and Hadoop HDFS jargon

    The benchmark test used in this case is the TestDSFIO tool (see below). It measures the results for writing files with different sizes while measuring "MapReduced CPU time" as a function of a different file size running over 10GbE network with and without RDMA enabled.

    The RDMA enabling of HDFS was done with our R4H (RDMA for HDFS) plugin.




    The Benchmark

    The TestDFSIO Write test reads and writes data to HDFS, and simulates stress testing according to a given configuration.


    For each run, the number of files were changed with a 4GB fixed file size.


    • File size: 0.5GB, 1GB, 4GB
    • Number of files: 3, 6, 9, 12, 15, 18, 21, 24


    For a better disk bandwidth (BW) utilization, the setup was configured to allow a maximum of 8 containers/files per node. This limits the TestDFSIO job to write maximum 23 files simultaneously. Since the maximum number of files is 24 (8 x 3 DNs), containers can be spawned. One of them would be used by the application master, and the rest 23 files would be used for map tasks.

    Each cycle is run with & without R4H to enable the RDMA acceleration for HDFS.

    For each configuration, 3 samples were run and an average of the results was published.



    4 hosts (3 datanodes), all connected via one 10GbE port to the SX1036 Ethernet Switch.



    • 2 X Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 6 cores, hyper-threading enabled
    • 4 X SATA HDD , 7.2K RPM
    • Network Adapter - Mellanox ConnectX-3
    • Ethernet Switch - Mellanox SX1036 40/56 GbE
    • Port speed - 10 GbE



    • RedHat 6.4
    • MLNX_OFED 2.4
    • CDH 5.1.2 (Cloudera)
    • R4H v1.2


    Hadoop Configuration

    The following configuration allows a maximum of 8 containers per node (All parameters updated in Yarn->configuration in Cloudera UI).

    The configuration is done according to HortonWorks recommendations.

    • yarn.scheduler.minimum-allocation-mb=4096
    • yarn.scheduler.maximum-allocation-mb=32768
    • yarn.nodemanager.resource.memory-mb=32768
    • mapreduce.map.memory.mb=4096
    • mapreduce.map.java.opts=-Xmx3276m
    • mapreduce.reduce.memory.mb=4096
    • mapreduce.reduce.java.opts=-Xmx3276m




    The chart below shows that when using RDMA over Mellanox ConnectX-3 adapters with Mellanox R4H plugin, this consistently improves MapReduce CPU time by 40~50%.

    The results were similar in every setup tested, meaning that you can load your production system with other applications jobs and still make a great utilization of your systems.

    In this setup, the disk IO (4 x HDD per server) is the main bottleneck. Therefore, we mainly see CPU improvement.

    Due to this disk bottleneck, the job execution time is almost the same for R4H and Vanilla (R4H is slightly better).

    The below chart shows the average job duration for all measurements.

    Here is a short script to run this benchmark:



    for conf in "org.apache.hadoop.hdfs.DistributedFileSystem" "com.mellanox.r4h.DistributedFileSystem"


      for f in 3 6 9 12 15 18 21 24


       for s in 500 1000 4000


       echo "****running with following parmaters: FilesNum=$f FileSize=$s hdfsplugin=$conf****"

      hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient.jar TestDFSIO -Ddfs.replication=3 -Dfs.hdfs.impl=$conf -write -nrFiles $f -fileSize ${s}MB