Reference Deployment Guide for RDMA over Converged Ethernet (RoCE) accelerated Apache Spark 2.2.0 over Mellanox 100GbE Network

Version 40

    In this document, we will demonstrate a multi-node cluster deployment procedure of RoCE Accelerated Apache Spark 2.2.0 and Mellanox end-to-end 100 Gb/s Ethernet solution.

    This document describes the process of installing a pre-builded Spark 2.2.0 standalone cluster of 17 physical nodes running Ubuntu 16.04.3 LTS . The HDFS cluster included 1 namenode server and 16 datanodes.

    As well we'll prepare our network for RoCE traffic accordingly with our recommendations.

     

    References

     

    Overview

    What is Apache Spark™?

    Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

     

    Mellanox SparkRDMA Plugin

     

    Apache Spark™ replaces MapReduce

    MapReduce, as implemented in Hadoop, is a popular and widely-used engine. In spite of its popularity, MapReduce suffers from high-latency and its batch-mode response is painful for lots of applications that process and analyze data.
    Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads.
    One of the most interesting features of Spark is its efficient use of memory, while MapReduce has always worked primarily with data stored on disk.

     

    Accelerating Spark Shuffle

    Shuffling is the process of redistributing data across partitions (that is, re-partitioning) between stages of computation. It is a costly process that should be avoided when possible.
    In Hadoop, shuffle writes intermediate files to the disk. These files are pulled by the next step/stage. With Spark shuffle, datasets are kept in memory and make data within reach.
    However, when working in a cluster, network resources are required for fetching data blocks, adding on overall execution time.
    The SparkRDMA plugin accelerates the network fetch of data blocks using RDMA/RoCE technology, which reduces CPU usage and overall execution time.

     

     

    SparkRDMA Plugin

    SparkRDMA plugin is a high-performance, scalable and efficient ShuffleManager open-source plugin for Apache Spark.
    It utilizes RDMA/RoCE (Remote Direct Memory Access/ RDMA over Converged Ethernet) technology to reduce CPU cycles needed for Shuffle data transfers, reducing memory usage by reusing memory for transfers rather than copying data multiple times as the traditional TCP-stack does.
    SparkRDMA plugin is built to provide the best performance out of the box. Additionally, it provides multiple configuration options to further tune SparkRDMA on a per-job basis.

    SparkRDMA is build to provide the best performance out of the box.

     

    Performance

    Lab tests show x1.53 overall reduced execution time on TeraSort

    https://user-images.githubusercontent.com/1121987/38248890-83097350-3752-11e8-8aba-5cc8ed89a9b2.png

     

    and x1.39 reduction with Sort

    https://user-images.githubusercontent.com/1121987/38267993-ac3089d4-3785-11e8-91f9-016d363a8fb1.png

    Setup Overview

    Before you start, make sure you are aware of the Apache Cluster multi-node cluster architecture, see Overview - Spark 2.2.0 Documentation for more info.

     

    Equipment

    In the distributed Spark/HDFS configuration described in this guide, we are using the following hardware specification.

     

     

    This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

     

    Setup Logical Design

     

     

    Spark Cluster Logical Design

    Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

    Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (in this document we will use standalone and YARN), which allocate resources across applications.
    Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.
    Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

    Spark cluster components

    HDFS Architecture

    Server Wiring

     

    In our reference we will wire the 1st port to an Ethernet switch and will not use the 2nd port.

    We will cover the procedure later in the Installing Mellanox OFED section.

     

    Server Block Diagram

    Network Configuration

    Each server is connected to the SN2700 switch by a 100GbE copper cable.

    The switch port connectivity in our case is as follow:

     

    • 1th port – connected to the Namenode Server
    • 2st -17th ports – connected to Worker Servers

     

    Server names with network configuration provided below

    Server type

    Server name

    IP and NICS                

    Internal network

    External network

    Master/Namenode

    Node 01

    clx-mld-41

    enp1f0: 31.31.31.41

    eno1: From DHCP (reserved)

    Node 02

    clx-mld-42

    enp1f0: 31.31.31.42

    eno1: From DHCP (reserved)

    Node 03

    clx-mld-43

    enp1f0: 31.31.31.43

    eno1: From DHCP (reserved)

    Node 04

    clx-mld-44

    enp1f0: 31.31.31.44

    eno1: From DHCP (reserved)

    Node 05

    clx-mld-45

    enp1f0: 31.31.31.45

    eno1: From DHCP (reserved)

    Node 06clx-mld-46enp1f0: 31.31.31.46eno1: From DHCP (reserved)
    Node 07clx-mld-47enp1f0: 31.31.31.47eno1: From DHCP (reserved)
    Node 08clx-mld-48enp1f0: 31.31.31.48eno1: From DHCP (reserved)
    Node 09clx-mld-49enp1f0: 31.31.31.49eno1: From DHCP (reserved)
    Node 10clx-mld-50enp1f0: 31.31.31.50eno1: From DHCP (reserved)
    Node 11clx-mld-51enp1f0: 31.31.31.51eno1: From DHCP (reserved)
    Node 12clx-mld-52enp1f0: 31.31.31.52eno1: From DHCP (reserved)
    Node 13clx-mld-53enp1f0: 31.31.31.53eno1: From DHCP (reserved)
    Node 14clx-mld-54enp1f0: 31.31.31.54eno1: From DHCP (reserved)
    Node 15clx-mld-55enp1f0: 31.31.31.55eno1: From DHCP (reserved)
    Node 16clx-mld-56enp1f0: 31.31.31.56eno1: From DHCP (reserved)
    Node 17clx-mld-57enp1f0: 31.31.31.57eno1: From DHCP (reserved)

     

     

    Deployment Guide

     

    Prerequisites

     

    Updating Ubuntu Software Packages on the Master and Worker Servers

    To update/upgrade Ubuntu software packages, run the commands below.

    $ sudo apt-get update            # Fetches the list of available update

    $ sudo apt-get upgrade -y        # Strictly upgrades the current packages

    Installing General Dependencies on the Master and Worker Servers

    To install general dependencies, run the commands below or paste each line.

    $ sudo apt-get install git bc

    $ sudo apt-get install python-software-properties
    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    $ sudo apt-get install oracle-java8-installer

    Adding Entries to Host files on the Master and Worker Servers

    Edit the host files:

    $ sudo vim /etc/hosts

    Now add entries of namenoder (master) and worker servers

    127.0.0.1       localhost

    127.0.1.1       clx-mld-42.local.domain      clx-mld-42

    #127.0.1.1              clx-mld-42

     

     

    # The following lines are desirable for IPv6 capable hosts

    ::1     localhost ip6-localhost ip6-loopback

    #ff02::1 ip6-allnodes

    #ff02::2 ip6-allrouters

    31.31.31.41  namenoder

    31.31.31.42  clx-mld-42-r

    31.31.31.43  clx-mld-43-r

    31.31.31.44  clx-mld-44-r

    31.31.31.45  clx-mld-45-r

    31.31.31.46  clx-mld-46-r

    31.31.31.47  clx-mld-47-r

    31.31.31.48  clx-mld-48-r

    31.31.31.49  clx-mld-49-r

    31.31.31.50  clx-mld-50-r

    31.31.31.51  clx-mld-51-r

    31.31.31.52  clx-mld-52-r

    31.31.31.53  clx-mld-53-r

    31.31.31.54  clx-mld-54-r

    31.31.31.55  clx-mld-55-r

    31.31.31.56  clx-mld-56-r

    31.31.31.57  clx-mld-57-r

    Required Software

    Prior to install and configure a Apache Spark and  SparkRDMA environment, the following software must be downloaded.

     

     

    Creating Network File System (NFS) Share

    Install the NFS server on the Master server. Create the directory /share/spark_rdma and export it to all Worker servers.
    Install the NFS client of all Worker servers. Mount /share/spark_rdma export from Master server to /share/spark_rdma local directory (the same path as on Master).

     

    Configuring SSH

    1. Install Open SSH Server-Client
      $ sudo apt-get install openssh-server openssh-client
    2. Generate Key Pairs
      $ ssh-keygen -t rsa -P ""
    3. Configure passwordless SSH
    4. Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master)
    5. Check by SSH to all the Slaves:
      $ ssh clx-mld-41-r

      $ ssh clx-mld-42-r

      $ ssh clx-mld-43-r

      ...

      $ ssh clx-mld-57-r

       

    Downloading Apache Spark

    1. Go to Downloads | Apache Spark and download the Apache Spark™ and the spark-2.2.0-bin-hadoop2.7.tgz  to share the /share/spark_rdma shared folder.
    2. Choose a Spark release:

       3. Choose a package type:

       4. Download Spark: spark-2.2.0-bin-hadoop2.7.tgz

    https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz  
       5. Verify this release using the 2.2.0 signatures and checksums and project release KEYS.

    Note: Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package and build with Scala 2.10 support.

    Downloading Mellanox SparkRDMA 2.0

    Download the SparkRDMA Release Version 2.0 and save it in the  /share/spark_rdma shared folder.

    SparkRDMA Release Version 2.0

    @petro-rudenko petro-rudenko released this 16 days ago · 1 commit to master since this release

    Assets

    All-new implementation of SparkRDMA, redesigned from the ground up to further increase scalability, robustness and most importantly - performance.
    Among the new features and capabilities introduced in this release:

    • All-new Metadata (Map Output) fetching protocol - now allows scaling to the tens of thousands of partitions, with superior performance and recoverability
    • Software-level flow control in RdmaChannel - eliminates pause storms in the fabric
    • ODP (On-Demand Paging) support - improves memory efficiency

    Attached are pre-built binaries. Please follow the README page for instructions.

     

    Cloning the HiBench Suite 7.0 Repository

    To clone the latest HiBench repository, run the following command:

    $ cd /share/spark_rdma
    $ git clone https://github.com/intel-hadoop/HiBench.git

    The preceding git clone command creates a subdirectory called “HiBench”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:

    $ cd HiBench
    $ git checkout master        # where master is the desired branch (by default)

    $ cd <Path to NFS share>

     

    Network Switch Configuration

     

    Please start from the HowTo Get Started with Mellanox switches guide if you don't familiar with Mellanox switch software.

    For more information please refer to the MLNX-OS User Manual located at support.mellanox.com or www.mellanox.com -> Switches

     

    In first step please update your switch OS to the latest ONYX OS software. Please use this community guide HowTo Upgrade MLNX-OS Software on Mellanox switch systems.

    We will accelerate Spark by using RoCE transport.
    There are several industry standard network configuration for RoCE deployment.

    You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.

    In our deployment we’ll configure our network to be lossless and will use DSCP on host and switch sides:

     

    Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.

     

    swx-mld-1-2 [standalone: master] > enable

    swx-mld-1-2 [standalone: master] # configure terminal

    swx-mld-1-2 [standalone: master] (config) # show running-config                                                                                        

    ##                                                                                                                                                     

    ## Running database "initial"                                                                                                                          

    ## Generated at 2018/03/10 09:38:38 +0000                                                                                                              

    ## Hostname: swx-mld-1-2                                                                                                                               

    ##                                                                                                                                                     

     

    ##

    ## Running-config temporary prefix mode setting

    ##                                           

    no cli default prefix-modes enable           

     

    ##

    ## License keys

    ##           

       license install LK2-RESTRICTED_CMDS_GEN2-43A1-4H83-GNA5-G423-GYJ8-8A60-E0AH-57AB

     

    ##

    ## Interface Ethernet buffer configuration

    ##

       traffic pool roce type lossless

       traffic pool roce memory percent 50.00

       traffic pool roce map switch-priority 3

     

    ##

    ## LLDP configuration

    ##

       lldp

     

    ##

    ## QoS switch configuration

    ##

       interface ethernet 1/1-1/32 qos trust L3

       interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

     

    ##

    ## DCBX ETS configuration

    ##

       interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

     

     

    ##

    ## Other IP configuration

    ##

       hostname swx-mld-1-2

     

    ##

    ## AAA remote server configuration

    ##

    # ldap bind-password ********

    # radius-server key ********

    # tacacs-server key ********

     

    ##

    ## Network management configuration

    ##

    # web proxy auth basic password ********

     

    ##

    ## X.509 certificates configuration

    ##

    #

    # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991

    # (public-cert config omitted since private-key config is hidden)

     

    ##

    ## Persistent prefix mode setting

    ##

    cli default prefix-modes enable

     

     

    Installing MLNX_OFED for Ubuntu on the Master and Workers

    This chapter describes how to install and test MLNX_OFED for Linux package on a single host machine with Mellanox ConnectX®-5 adapter card installed.
    For more information click on Mellanox OFED for Linux User Manual.

     

    Downloading Mellanox OFED

    1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
      # lspci -v | grep Mellanox
      The following example shows a system with an installed Mellanox HCA:
    2. Download the ISO image according to you OS to your servers share folder.
      The image’s name has the format
      MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso. You can download it from:
      http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.
    3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page.
      # md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

    Installing Mellanox OFED

    MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

    • Discovers the currently installed kernel
    • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack
    • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel)
    • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware

    The installation script removes all previously installed Mellanox OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages.

    1. Log into the installation machine as root.
    2. Copy the downloaded ISO to /root
    3. Mount the ISO image on your machine.
      # mkdir /mnt/iso
      # mount -o loop /share/MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64.iso /mnt/iso
      # cd /mnt/iso
    4. Run the installation script.
      # ./mlnxofedinstall
    5. Reboot after the installation finished successfully.

      # /etc/init.d/openibd restart
      # reboot


      By default both ConnectX®-5 VPI ports are initialized as InfiniBand ports.
      ConnectX®-5 ports can be individually configured to work as InfiniBand or Ethernet ports.

    6. Check the ports’ mode is Ethernet
      # ibv_devinfo
    7. If you see the following, change the interfaces port type to Ethernet

      Change the interfaces port type to Ethernet mode.
      Use the mlxconfig script after the driver is loaded.
      * LINK_TYPE_P1=2 is a Ethernet mode

      a. Start mst and see ports names

      # mst start
      # mst status

      b. Change the mode of 1 port to Ethernet:

      # mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2
        Port 1 set to Ethernet mode
      # reboot
      c. Query the Ethernet devices and print the information available to use it from the userspace.
      # ibv_devinfo
    8. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices/ports.
      # ibdev2netdev
      # ifconfig enp1f0 31.31.31.41 netmask 255.255.255.0
    9. Insert to the /etc/network/interfaces file the lines below after the following lines:
      # vim /etc/network/interfaces

      auto eno0
      iface eno0 inet dhcp
      The new lines:
      auto enp1f0
      iface enp1f0 inet static
      address 31.31.31.xx
      netmask 255.255.255.0
      Example:

      # vim /etc/network/interfaces
      auto eno0

      iface eno0 inet dhcp

      auto enp1f0
      iface eenp1f0 inet static
      address 31.31.31.41
      netmask 255.255.255.0

    10. Check the network configuration is set correctly.
      # ifconfig -a

    Lossless fabric with L3(DSCP) configuration

     

    Infiniband networks are inherently lossless. They incorporate a link level flow control to ensure that packets are not dropped within the fabric.
    RoCE implements the Infiniband protocol over a standard Ethernet/IP network, which can be lossy.
    Due to the performance implications of a lossy network when running RoCE, it is recommended to enable flow control within your fabric.
    Check the flow control settings for Mellanox network adapters by run the command: ethtool -a <mlnx interface name>

    # ethtool -a enp1f0

    Pause parameters for enp1f0:
    Autonegotiate:      off
    RX:         off
    TX:         off

    If the RX and TX settings are turned off, then they should be enabled:

    # ethtool -A enp1f0 rx on tx on
    # ethtool -a enp1f0
    Pause parameters for enp1f0:
    Autonegotiate:      off
    RX:         on
    TX:         on

    This post provides a configuration example for Mellanox devices installed with MLNX_OFED running RoCE over a lossless network, in DSCP-based QoS mode.

     

    • It is also recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation.
      Running mlnx_tune requires superuser privlidges.
      $ sudo mlnx_tune 
      2017-08-16 14:47:17,023 INFO Collecting node information
      2017-08-16 14:47:17,023 INFO Collecting OS information
      2017-08-16 14:47:17,026 INFO Collecting CPU information
      2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information
      2017-08-16 14:47:17,107 INFO Collecting Firewall information
      2017-08-16 14:47:17,111 INFO Collecting IP table information
      2017-08-16 14:47:17,115 INFO Collecting IPv6 table information
      2017-08-16 14:47:17,118 INFO Collecting IP forwarding information
      2017-08-16 14:47:17,122 INFO Collecting hyper threading information
      2017-08-16 14:47:17,122 INFO Collecting IOMMU information
      2017-08-16 14:47:17,124 INFO Collecting driver information
      2017-08-16 14:47:18,281 INFO Collecting

      Mellanox devices information  Mellanox Technologies - System Report 

      Operation System Status RH7.2 3.10.0-327.el7.x86_64  CPU Status Intel Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz N/A OK: Frequency 2599.98MHz  Hyper Threading Status INACTIVE  IRQ Balancer Status ACTIVE  Firewall Status NOT PRESENT IP table Status INACTIVE IPv6 table Status INACTIVE  Driver Status OK: MLNX_OFED_LINUX-3.4-2.1.8.0 (OFED-3.4-2.1.8)  ConnectX-4 Device Status on PCI 03:00.0 FW version 12.18.2000 OK: PCI Width x16 OK: PCI Speed 8GT/s PCI Max Payload Size 256 PCI Max Read Request 512 Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  ens2 (Port 1) Status Link Type eth OK: Link status Up Speed 100GbE MTU 1500 OK: TX nocache copy 'off'  2017-08-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_170816_144716.log
    • After running the mlnx_tune command, it is highly recommended to set the cpuList parameter (described in Configuration Properties) within spark.conf file using the NUMA cores associated with the Mellanox device.
      Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  
      spark.shuffle.rdma.cpuList 0-15
    • More in-depth performance resources can be found in the Mellanox Community post: Performance Tuning for Mellanox Adapters

    Configuring the Environment

    Some actions must be taken after the installation before the Apache Spark can be used.

     

    Untar the Spark and SparkRDMA Tarballs

    $ cd /share/spark_rdma
    $ tar -xzvf spark-2.2.0-bin-hadoop2.7.tgz
    $ tar -xzvf spark-rdma-2.0.tgz

    All the scripts, jars, and configuration files are available in newly created directory “spark-2.2.0-bin-hadoop2.7”

     

    Setup Configuration

     

    Spark

    • Update your bash file.
      $ vim ~/.bashrc
      This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
      export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
      export SPARK_HOME=/share/spark_rdma/spark-2.2.0-bin-hadoop2.7/
    • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
      $ source ~/.bashrc
    • Check that the paths have been properly modified.
      $ echo $JAVA_HOME
      $ echo $SPARK_HOME
      $ echo $LD_LIBRARY_PATH
    • Edit spark-env.sh

    Now edit configuration file spark-env.sh (in $SPARK_HOME/conf/) and set following parameters:
    Create a copy of template of spark-env.sh and rename it. Add a master hostname, interface and spark_tmp directory.

    $ cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf
    $ cp spark-env.sh.template spark-env.sh
    $ vim spark-env.sh

    # Generic options for the daemons used in the standalone deploy mode
    # - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
    # - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
    # - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
    # - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $vuhuong)
    # - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

    #export SPARK_MASTER_HOST=spark1-r
    export SPARK_MASTER_HOST=namenoder
    export SPARK_LOCAL_IP=`/sbin/ip addr show enp5s0f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`
    export SPARK_LOCAL_DIRS=/tmp/spark-tmp

     

    • Add slaves
      Create a copy of the template of slaves configuration file, rename it to slaves (in $SPARK_HOME/conf/) and add the following entries:
      $ vim /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves
      clx-mld-41-r
      clx-mld-42-r
      clx-mld-43-r
      ...
      clx-mld-57-r

     

    Hadoop

     

    • Copy conf/slaves to hadoop-2.7.4/etc/hadoop/slaves.
      $ cp /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves /share/spark_rdma/hadoop-2.7.4/etc/hadoop/slaves

     

    • Create a distributed file system

      $ sbin/slaves.sh mkdir /data/hadoop_tmp  # on NVMe disk

     

    • Edit current_config/core-site.xml.Add both property tags in the configuration tags <configuration>.

      $ cd /share/spark_rdma/hadoop-2.7.4
      $
      vim current_config/core-site.xml
      <?xml version="1.0" encoding="UTF-8"?>

      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
      http://www.apache.org/licenses/LICENSE-2.0
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
      -->
      <!-- Put site-specific property overrides in this file. -->

       

      <configuration>

              <property>

                      <name>fs.default.name</name>

                      <value>hdfs://namenoder:9000</value>

              </property>

              <property>

                      <name>hadoop.tmp.dir</name>

                      <value>/data/hadoop_tmp</value>

              </property>

      </configuration>

     

    • Update your bash file.
      $ vim ~/.bashrc
      This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
      export HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/
    • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
      $ source ~/.bashrc
    • Check that the paths have been properly modified.
      $ echo $HADOOP_HOME
    • Edit current_config/hadoop-env.sh
      Now edit configuration file current_config/hadoop-env.sh (in $HADOOP_HOME/etc/hadoop) and set following parameters:
      Add a HADOOP_HOME directory.

    $ vim current_config/hadoop-env.sh

    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # Set Hadoop-specific environment variables here.
    # The only required environment variable is JAVA_HOME.  All others are
    # optional.  When running a distributed configuration it is best to
    # set JAVA_HOME in this file, so that it is correctly defined on
    # remote nodes.
    # The java implementation to use.
    export JAVA_HOME=${JAVA_HOME}
    export HADOOP_HOME=/share/spark_rdma/hadoop-2.6.0/
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/

     

    HDFS

     

    • Edit current_config/hdfs-site.xml

    <!-- Put site-specific property overrides in this file. -->
    <configuration>

            <property>

                    <name>dfs.datanode.dns.interface</name>

                    <value>enp5s0f0</value>

            </property>

            <property>

                    <name>dfs.replication</name>

                    <value>1</value>

            </property>

            <property>

                    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>

                    <value>false</value>

            </property>

            <property>

                    <name>dfs.permissions</name>

                    <value>false</value>

            </property>

            <property>

                    <name>dfs.datanode.data.dir</name>

                    <value>/data/hadoop_tmp</value>

            </property>

            </configuration>

     

    YARN

     

    • Edit current_config/yarn-site.xml

      $ vim current_config/yarn-site.xml

      <?xml version="1.0"?>

      <!--

        Licensed under the Apache License, Version 2.0 (the "License");

        you may not use this file except in compliance with the License.

        You may obtain a copy of the License at

       

          http://www.apache.org/licenses/LICENSE-2.0

       

        Unless required by applicable law or agreed to in writing, software

        distributed under the License is distributed on an "AS IS" BASIS,

        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

        See the License for the specific language governing permissions and

        limitations under the License. See accompanying LICENSE file.

      -->

      <configuration>

       

      <!-- Site specific YARN configuration properties -->

              <property>

                      <name>yarn.nodemanager.aux-services</name>

                      <value>mapreduce_shuffle</value>

              </property>

              <property>

                      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

                      <value>org.apache.hadoop.mapred.ShuffleHandler</value>

              </property>

              <property>

                      <name>yarn.resourcemanager.resource-tracker.address</name>

                      <value>namenoder:8025</value>

              </property>

              <property>

                      <name>yarn.resourcemanager.scheduler.address</name>

                      <value>namenoder:8030</value>

              </property>

              <property>

                      <name>yarn.resourcemanager.admin.address</name>

                      <value>namenoder:8032</value>

              </property>

              <property>

                      <name>yarn.resourcemanager.webapp.address</name>

                      <value>namenoder:8034</value>

              </property>

              <property>

                      <name>yarn.resourcemanager.address</name>

                      <value>namenoder:8101</value>

              </property>

              <property>

                      <name>yarn.nodemanager.resource.memory-mb</name>

                      <value>40960</value>

              </property>

              <property>

                      <name>yarn.scheduler.maximum-allocation-mb</name>

                      <value>40960</value>

              </property>

              <property>

                      <name>yarn.scheduler.minimum-allocation-mb</name>

                      <value>2048</value>

              </property>

             <property>
                     <name>yarn.nodemanager.resource.cpu-vcores</name>

                     <value>20</value>

              </property>

              <property>

                      <name>yarn.nodemanager.disk-health-checker.enable</name>

                      <value>false</value>

              </property>

              <property>

                      <name>yarn.nodemanager.log-dirs</name>

                      <value>/tmp/yarn_nm/</value>

              </property>

              <property>

                      <name>yarn.log-aggregation-enable</name>

                      <value>true</value>

              </property>

      </configuration>

     

    • Edit current_config/yarn-env.sh. Specify where hadoop directory.
      $ vim current_config/yarn-env.sh
      ...
      export HADOOP_HOME=/share/data/hadoop-2.7.4
      ....

     

    • Copy  current_config/* to etc/hadoop/.
      $ cp current_config/* to etc/hadoop/

     

    • Execute the following command on the NameNode host machine to HDFS format:
      $ bin/hdfs namenode -format

     

    Start Spark Standalone Cluster

    To start a Spark Services run following commands on Master:

    $ cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7
    $ sbin/start-all.sh

     

    Check whether Services have been Started

    Check daemons on Master

    $ jps

    33970 Jps

    47928 ResourceManager

    48121 NodeManager

    47529 DataNode

    47246 NameNode

                                                          

    Check daemons on Master

    $ jps

    1846 NodeManager

    16491 Jps

    1659 DataNode

     

    Check HDFS and YARN status

    $ cd /share/spark_rdma/hadoop-2.7.4

    $ bin/hdfs dfsadmin -report | grep Name

    Name: 31.31.31.43:50010 (clx-mld-43-r)

    Name: 31.31.31.47:50010 (clx-mld-47-r)

    Name: 31.31.31.48:50010 (clx-mld-48-r)

    Name: 31.31.31.45:50010 (clx-mld-45-r)

    Name: 31.31.31.42:50010 (clx-mld-42-r)

    Name: 31.31.31.41:50010 (namenoder)

    ...

    Name: 31.31.31.57:50010 (clx-mld-57-r)

     

    $ bin/hdfs dfsadmin -report | grep Name -c

    8

     

    $ bin/yarn node -list

    18/02/20 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at namenoder/31.31.31.41:8101

    Total Nodes:8

             Node-Id      Node-State Node-Http-Address Number-of-Running-Containers

    clx-mld-42-r:34873         RUNNING clx-mld-42-r:8042                            0

    clx-mld-47-r:35045         RUNNING clx-mld-47-r:8042                            0

    clx-mld-48-r:44996         RUNNING clx-mld-48-r:8042                            0

    clx-mld-46-r:45432         RUNNING clx-mld-46-r:8042                            0

    clx-mld-45-r:41307         RUNNING clx-mld-45-r:8042                            0

    ...

    clx-mld-57-r:44311         RUNNING clx-mld-57-r:8042                            0

    namenoder:41409         RUNNING    namenoder:8042                          0

    Stop the Cluster

    To stop a Spark Services run following commands on Master:

    $ cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7

    $ sbin/stop-all.sh

    Conclusion

    After installing the Apache Spark on the multi-node cluster you are now ready to work with Spark platform. Now you can run the HiBench Suite Benchmarks and TCP vs. RoCE comparison

     

    Running HiBench with SparkRDMA

    HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.
    It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

    Environment

    • Instance type and Environment: See setup overview
    • OS: Ubuntu 16.04.3 LTS
    • Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
    • Spark: 2.2 standalone 17 nodes
    • Benchmark : Setup HiBench
    • Test Date: Mar 2018

     

    SparkRDMA performance tips

    Compression! Spark enables compression as the default runtime option.
    Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data.
    Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkRDMA. In your spark.conf file set:

    spark.shuffle.compress  

    HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.
    It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

    Environment

    • Instance type and Environment: See setup overview
    • OS: Ubuntu 16.04.3 LTS
    • Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
    • Spark: 2.2 standalone 17 nodes
    • Benchmark : Setup HiBench
    • Test Date: Mar 2018

     

    SparkRDMA performance tips

    • Compression! Spark enables compression as the default runtime option.
      Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data.
      Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkRDMA. In your spark.conf file set:
      spark.shuffle.compress          false
      spark.shuffle.spill.compress    false
      By disabling compression, you will be able to reclaim precious CPU cycles that were previously used for data compression/decompression, and will also see additional performance benefits in the RDMA data transfer speeds.
    • Disk Media! In order to see the highest and most consistent performance results possible, it is recommended to use the highest performance disk media available. Using a ramdrive or NVMe device for the spark-tmp and hadoop tmp files should be explored whenever possible.

     

    Benchmarks runs

    Steps to reproduce Terasort result:

    1. Configure Hadoop and Spark settings in  directory.
    2. In HiBench/conf/hibench.conf set:
      hibench.scale.profile bigdata
      # Mapper number in hadoop, partition number in Spark
      hibench.default.map.parallelism 1000
      # Reducer nubmer in hadoop, shuffle partition number in Spark
      hibench.default.shuffle.parallelism 700

       

    3. Set in HiBench/conf/workloads/micro/terasort.conf:

      hibench.terasort.bigdata.datasize               1890000000

       

    4. Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh

    5. Open HiBench/report/hibench.report:

      Type               Date       Time       Input_data_size   Duration(s)    Throughput(bytes/s)     Throughput/node
      ScalaSparkTerasort 2018-03-26 19:13:52   189000000000      79.931         2364539415              2364539415

       

    6. Add to HiBench/conf/spark.conf:

      spark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
      spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
      spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
      spark.shuffle.compress false
      spark.shuffle.spill.compress false

       

    7. Run HiBench/bin/workloads/micro/terasort/spark/run.sh
    8. Open HiBench/report/hibench.report:
      Type               Date       Time     Input_data_size     Duration(s)     Throughput(bytes/s)      Throughput/node
      ScalaSparkTerasort 2018-03-26 19:13:52 189000000000        79.931          2364539415               2364539415
      ScalaSparkTerasort 2018-03-26 19:17:13 189000000000        52.166          3623049495               3623049495
    9. Overall improvement:

    https://user-images.githubusercontent.com/1121987/38248890-83097350-3752-11e8-8aba-5cc8ed89a9b2.png

     

    Steps to reproduce Scala sort result:|

    1. In HiBench/conf/hibench.conf set:
      hibench.scale.profile bigdata
      # Mapper number in hadoop, partition number in Spark
      hibench.default.map.parallelism 1000

      # Reducer nubmer in hadoop, shuffle partition number in Spark
      hibench.default.shuffle.parallelism 7000
    2. Run HiBench/bin/workloads/micro/sort/prepare/prepare.sh and HiBench/bin/workloads/micro/sort/spark/run.sh
    3. Open HiBench/report/hibench.report:
      Type           Date       Time     Input_data_size    Duration(s)     Throughput(bytes/s)     Throughput/node
      ScalaSparkSort 2018-04-03 19:13:24 307962225944       37.898          8126081216              8126081216
      ScalaSparkSort 2018-04-03 21:16:56 307962098703       48.608          6335625796              6335625796
    4. Overall improvement:

    https://user-images.githubusercontent.com/1121987/38267993-ac3089d4-3785-11e8-91f9-016d363a8fb1.png