Hadoopwordcount Project Report

HadoopWordCount Project Report

Peng Chen

1.Introduction

Hadoopis a framework written in Java for running applications on large clusters of commodity hardware and incorporates features of theHadoop File Systemand ofMapReduce.HadoopWordCount is the beginning program which counts the occurrences of each word in a given set of text files. The main goal of thisproject is to install and run Hadoopeither on a single node or on a multi-node cluster, and then run HadoopWordCount to calculate its speed up.

2.Experiment Environment

We run HadoopWordcount both in CS machines and FutureGridEucalyptus.

Machine / Memory / CPU / OS
CS machine / 4GB / Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz / Red Hat Enterprise Linux Workstation release 6.1
c1.medium (Eucalyptus) / 1GB / Intel(R) Xeon(R) CPU X5570 @ 2.93GHz / Ubuntu 10.04

3.Installation & Running Script

Startup script “MultiNodesOneClickStartUp.sh” is responsible for installing Hadoop:

1)It first uses “netstat” and “grep” to find four unique available ports in local machine. The Hadoop will occupy these ports to send and receive data and messages, other applications sharing these ports will cause loss/incorrect communications, and other applications occupying these ports will blocking the send/receive operation of Hadoop.

2)It then creates “conf/masters” by replacing the parameters in “conf/masters_template” with the actual IP/hostname of master, and creates “conf/salves” by writing the actual IP/hostname of slaves. The master node is designated as the NameNode and JobTracker, while the slave node act as DataNode and TaskTracker.

3)Then it sets the “JAVA_HOME” in “conf/hadoop-env.sh” and copies it to all the master/slave node.

4)Then it creates the file “conf/core-site.xml” by populating parameters in “conf/core-site_template.xml” with actual hostname, the first unique available port as filesystem port and username. It also copy this to all slave nodes, because the slave nodes also need the identical information for reference.

5)Then it creates the file “conf/hdfs-site.xml” by populating parameters in “conf/hdfs-site_template.xml” with actual hostname, the second unique available port as hdfs port and username. It also copy this to all slave nodes, because the slave nodes also need the identical information for reference.

6)Then it creates the file “conf/mapred-site.xml” by populating parameters in “conf/mapred-site_template.xml” with actual hostname, the third unique available port as mapper port, the forth unique available port as reducer port and username. It also copy this to all slave nodes, because the slave nodes also need the identical information for reference.The default number of mappers and reducers are set to be 2 in “conf/mapred-site_template.xml” and was given unmodified into “conf/mapred-site.xml”.

It is also responsible for runninghadoop:

1)Stop the hdfs and mapreducedaemon; otherwise the previous daemon left unclosed will stop the startup this time.

2)Format file system (NameNode) and clean files under /tmp/, otherwise the errors in previous file system will lead to failures this time.

3)Start hdfs and mapreduce daemon.

4.Performance test

4.1WordCount running on CS machines

Mode / Number of Mappers on each node / Number of Reducers on each node / Number of Nodes / Size of Input Document / Execution Time (an average of 10 runs) / Speed up
Single node mode (sequential) / 1 / 1 / 1 / 160 MB / 184.9 sec / 1
Single node mode / 2 / 2 / 1 / 160 MB / 141.8 sec / 1.3
Two nodes mode / 2 / 2 / 2 / 160 MB / 102.4 sec / 1.8

Note: To get the sequential execution time, the first group of runs has only 1 mapper and 1 reducer.

Figure 1. Execution time of HadoopWordCount on CS machine

Figure 2. Speed Up of running HadoopWordCount on CS machine

4.2WordCount running on FutureGrid Eucalyptus

Mode / Number of Mappers on each node / Number of Reducers on each node / Number of Nodes / Size of Input Document / Execution Time (an average of 10 runs) / Speed up
Single node mode / 1 / 1 / 1 / 80 MB / 147.2 sec / 1
Two nodes mode / 1 / 1 / 2 / 80 MB / 89.4 sec / 1.65

Note: The number of mapper and reducer is set to 1 according to the number of core in each node.

Figure 3. Execution time of HadoopWordCount on FutureGrid Eucalyptus

Figure 4. Speed Up of running HadoopWordCount on FutureGrid Eucalyptus

5.Feedback to FutureGrid

Overall:

Compared with my past experience using Amazon EC2 and PlanetLab, I really enjoy using FuturGrid Eucalyptus.

Pros:

Eucalyptus provided very stable VM resources. I’ve never seen failures in running instances. The network environment is also very stable.

The CPU resource and network bandwidth are very good.

I get free VM resources which doesn’t have time limitation.

Cons:

The memory and disk resources are limited (for medium instances).

My instance was shut down without notification when Eucalyptus went into maintenance.