Evaluation for Hadoop Applications On

EVALUATION FOR HADOOP APPLICATIONS ON

SINGLE & MULTINODE ENVIRONMENTS

BINA BHASKAR

School of Informatics and Computing, Indiana University, Bloomington, IN, 47405

1. Problem Encountered

MapReduce is a framework that allows developers to write functions that process data. There are two types of key functions in the MapReduce framework, the Map function which separates out the data to be processed and the reduce function which performs analysis on that data.We are using a basic example of MapReduce that is commonly used as a process to understand Hadoop, this application counts the occurrence of a particular word in a document. The “Map” function in this application would produce a set of data that contained all occurrences the desired word from the source data, the “Reduce” function would then count the number of items produced by the Map function are return a numeric value indicating the total number of word occurrences.

The effectiveness of the map/reduce concept can be known when the process is run on various nodes where the load of the application is distributed across various nodes (machines). Running the word-count application on a single node does not make sense for the map/reduce paradigm, in this task we focus on deploying the wordcount.java application onto multiple nodes where the numbers of mappers and reducers can be altered and our aim is to analyze the performance difference that arises when it is run on a single node when compared to multiple nodes.

2. Methodology to tackle the problem

For this particular use case in order to obtain the results for the performance test we try to deploy the wordcount.java in two separate environments and also compare the results of running the application on a single node and multi-nodes on both the environments. The two environments that I am using now are CS Linux machines and Eucalyptus. The master and slave nodes can be altered by making changes to the respective configuration files present within the hadoop directory.

3. Environment Settings

CS Linux Machines

For the single node execution we use the .SingleNodeOneClickStartUp.sh script after installing hadoop and its required version in our home directories. After this we generate the keys required to perform other function within the entire session. We generate the public and private keys using the command ““ssh-keygen -t rsa” and store these keys in a separate folder for privacy in our home directories. We copy the public key into a file name “authorized keys” and this allows us to ssh without prompting a password multiple number of times while running hadoop. We are required to set the ANT_HOME path for the building of the WordCount.java program. While executing the .SingleNodeOneClickStartUp.sh script we use the java path which is /usr/lib/jvm/java-sun.[1]

Eucalyptus VMs

After gaining access to the website of Eucalyptus we must go ahead and download the credential file from the website. We can ssh to the india.futuregrid.org (which we are using for this project) by typing at terminal

ssh –i <public-key for eucalyptus> <username>@india.futuregrid.org [2]

After logging into the futuregrid home we must put the credentials folder within our home directory and untar it. After this we follow the instructions given [2] and create instances we can use to run our jobs at. We can use the IP address of the VM that is generated during running the instances to perform our jobs at. After generating a new key pair, we can now ssh to that IP address and install hadoop within that VM using the java path as /usr/lib/jvm/java-6-sun/jre.[1]

4. Implementation Details

conf/master and conf/slaves nodes

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker. These are the actual “master nodes”. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves or “worker nodes”. Therefore the conf/master file defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster. On the other hand the conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run.[4]

conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml

These are the configuration files for hadoop. The core-sit.xml comprises of the NameNode (hdfs master) after which we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies the JobTracker (MapReduce master). Finally the conf/hdfs-site.xml specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available.[4]

In my case I have modified two of my configuration files namelyconf/mapred-site.xmland conf/mapred-site_template.xmlto change the number of map tasks and reduce tasks.

nodes

For multi node running of tasks, the nodes file in the hadoop directory has to be modified to list the nodes that act as the master and the slave. Both the master and the slave must have hadoop installed in them and the directory containing the required jar and input file we are processing.

For the CS machines environment the hadoop installation and deploying of the wordcount package had to be done once since the home directory would remain unchanged through all the various systems connected.

For the Eucalyptus environment hadoop had to be installed in the master and slave nodes and also the wordcount package had to be deployed onto both the master and slave nodes. So basically we have to copy (using scp and our public key) the wordcount package from our local CS Machine to the futuregrid home directory and from the futuregrid directory we have to copy (using scp and out privet key) the wordcount package to both the master and slave nodes.

While configuring the master and slave nodes it is very important to edit the /etc/hosts files with the worker nodes IP’s and name them as master and slave respectively. The same changes must go inside the nodes file in the hadoop directory. This configuration will indicate how the wordcount task will be distributed between which nodes.[1]

For the execution of the wordcount.java application on both the environments we follow the same implementation used in the previous assignment. [3]

5. Performance Analysis

As a part of performance analysis, I have considered two environments for processing the jobs, one being the CS-Linux machine’s environment and the other being the Eucalyptus environment. I have considered a frame of 7 runs for the process I was running (wordcount.java program) on two types of environments one being the single node and the other being multinode. I have varied the number of map tasks and reduce tasks for all my recorded 7 runs and taken the average time of 10 runs for each process, the results are as shown below.As can be seen from the tabular data there is a very significant speed-up that is observed in the Eucalyptus environment mode between running the process on a single and a multi node. This can be further observed in the graphical representation as well.

CS-Linux Environment

Eucalyptus Environment

Graphical Representations

Single Node Execution Time Chart

Multi-Node Execution Time Chart

6. Futuregrid Feedback

Performance wise as we can see from the tabular data Futuregrid gives a much more distinct speed up between application run on a single node and multi node. Although the execution time is considerably higher than the local CS machines set up, futuregrid provides quick creation of VMs according to requirements and once the instructions to setup the VMs is known it is easy to deploy the jobs.

The account set up is straight forward and it would help to have a futuregrid forum where people could discuss their issues and interact. It would also help to have a detailed execution process listed as a document for users to look-up and follow.

Finally one other observation is that futuregrid prefers to have the public-private key pair authentication system which in my opinion although a secure way can be misused once there is any inappropriate directory access. A better way would have been to integrate the username and password that we use for Eucalyptus with the Futuregrid accounts, if there was such a possibility.

7. References

[1]. Documentation provided by the instructor

[2]. Futuregrid Tutorials

[3]. ReadMe.txt from the previous assignment

[4.] Hadoop Tutorial