If you have any question regarding the assignment, please post it to the “Discussion” section on Canvas so everyone can get help faster. Thanks!

Part One: Hadoop set up for Map Reduce

Deliverables:

For part I, you need to write a report on what you have done and provide a few typical screenshots (marked in red): the java install, word count output (just need part of it since it is very long), server monitoring page ( and and bin/hdfs operations. All you have to do is following the instructions, copying and pasting codes.

This part is to install and setup Hadoop environment.

The following instructions have been tested on DigitalOcean Ubuntu 16.04.03x64 image. If you have a local Linux machine, you can carry out the project using your local Linux. You can also carry out the project using Amazon EC2, Digital Ocean, or Google Cloud.

If you do not have a Digital Ocean account, you may use this link: create an account. You are supposed to get 10$ credit. It is not guaranteed though. After this, you create an Ubuntu Droplet. You can choose any size, but the faster the better (I used an 8gb/4cpu machine).

It is very important that you don't connect to the instance by an internet explorer console. If you are using Windows, please use PUTTY:

If you are using Mac, please refer to this post:

Digital Ocean Linux is not required for this project, but it may be easier than AWS since some students may encounter authentication issues with AWS’s Linux instance (based on the experience from previous classes). Attention: DON’T forget to destroy the Digital Ocean Droplet (this is how they named the Linux image) if you are not going to use it for a while. It will be counting fees as long as you have any Droplets running.

The codes and operations:

Install Java & Hadoop:

sudo apt-get update

sudo apt-get install default-jdk

java -version(screenshot)

cd /usr/local

wget

tar -xzvf hadoop-2.8.1.tar.gz

mv hadoop-2.8.1 hadoop

sudonano /usr/local/hadoop/etc/hadoop/hadoop-env.sh to edit:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

...

source /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Standalone HADOOP Operation:

cd hadoop

mkdir input
cd input
wget
wget
wget ### try more times if you encounter troubles getting the files
cd ..
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount input output
cat output/*(screenshot) if you see some weird strings, don’t worry, you are getting right results
Pseudo-Distributed Operation:
nanoetc/hadoop/core-site.xml to add
….
<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

sudonanoetc/hadoop/hdfs-site.xml:

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub > ~/.ssh/authorized_keys

export HADOOP\_PREFIX=/usr/local/hadoop

ssh localhost

exit

nano $HOME/.bashrc to add the following to the end:

…………..

# Set Hadoop-related environment variables

export HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

# Some convenient aliases and functions for running Hadoop-related commands

unalias fs > /dev/null

alias fs="hadoop fs"

unaliashls > /dev/null

alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and

# compress job outputs with LZOP (not covered in this tutorial):

# Conveniently inspect an LZOP compressed file from the command

# line; run via:

#

# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo

#

# Requires installed 'lzop' command.

#

lzohead () {

hadoop fs -cat $1 | lzop -dc | head -1000 | less

}

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

……………

source ~/.bashrc

bin/hdfsnamenode -format

sbin/start-dfs.sh

sbin/start-yarn.sh

By typing in command "jps", you should be able to see resourcemanager, datanode, namenode, jps, nodemanager, and secondarynamenode

check your server is running by opening

bin/hdfsdfs -mkdir /user
bin/hdfsdfs -mkdir /user/itis6320
bin/hdfsdfs -put input/* /user/itis6320
bin/hdfsdfs -ls /user/itis6320
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /user/itis6320 output
bin/hdfsdfs -ls output
bin/hdfsdfs -rm output/*(screenshot)
sbin/stop-dfs.sh
mv etc/hadoop/mapred-site.xml.templateetc/hadoop/mapred-site.xml
sudonanoetc/hadoop/mapred-site.xml

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

sudonanoetc/hadoop/yarn-site.xml

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

sbin/start-dfs.sh
sbin/start-yarn.sh

YOU WILL BE ABLE TO RUN YOUR MAPREDUCE JOB

sbin/stop-dfs.sh
sbin/stop-yarn.sh

Part Two: Movie Similarities

For this part, we will process a large corpus of movie ratings for providing recommendations. When you're done, your program will help you decide what to watch on Netflix tonight. For each pair of movies in the data set, you will compute their statistical correlation and cosine similarity (see this blog for a discussion of these and other potential similarity metrics). Since this isn't a statistics class, the calculation of similarity metrics for Python and Java will be provided, but you need to provide them with the correct inputs.

For this section of the assignment, we have two input data sets: a small setfor testing on your local machine or on digitalOcean and a large setfor running on Amazon's cloud (attention: DigitalOcean’s disk is too small for the large set). More info about the data set can be found here.

For both data sets, you will find two input files:

  • movies.csvor movies.datcontains identification and genre information for each movie.

Lines are of the form:

MovieID,MovieTitle,Genre

0068646,The Godfather (1972),Crime|Drama

0181875,Almost Famous (2000),Drama|Music

  • ratings.csvor ratings.datcontains a series of individual user movie ratings.

Lines are of the form:

UserID,MovieID,Rating,:Timestamp

120,0068646,10,1365448727

374,0181875,9,1374863640

In addition to these two input files, your program should take a few additional arguments:

  • -m [movie title]: The title of a movie for which we'd like to see similar titles. You should be able to accept multiple movie titles with more than one -m argument.
  • -k [number of items]: For each of the movies specified using -m, this specifies how many of the top matches we'd like to see. In other words, running with "-m The Godfather (1972) -k 10" would be asking for "the top ten movie recommendations if you liked The Godfather." (Default 25)
  • -l [similarity lower bound]: When computing movie similarity metrics, we'll produce a floating-point value in the range [-1, 1]. This input says to ignore any movie parings whose similarity metric is below this value. (Default 0.4)
  • -p [minimum rating pairs]: When computing similarity metrics, ignore any pair of movies that don't have at least this many shared ratings. (Default 3)

Please don't attempt to filter down to the movies specified via -m until the final step. I want you to compute the similarities for all movies. The -m argument is there to reduce the output size and make reading (and grading) the output easier. For the other arguments (-k, -l, and -p), you may filter whenever you want.

Output

Since we're computing two similarity metrics, we'll need to combine them into a single similarity value somehow. For your submission, you should blend the values together, using 50% of each. That is, your final value for a pair of movies is 0.5 * statistical correlation + 0.5 * the cosine correlation for the pair.

For each movie selected (-m), sort them from largest to smallest by their blended similarity metric, outputting only the top K (-k) most similar movies that are have at least the minimum blended similarity score (-l). For movies meeting this criterion, you should output:

  • The name of the movie for which we want similar titles (specified via -m).
  • The name of a similar movie.
  • The blended similarity metric that these two movies share.
  • The statistical correlation that these two movies share.
  • The cosine correlation that these two movies share.
  • The number of ratings this pair of movies had in common.

You may format your output however you like, as long as the values are in the correct order and I can reasonably make sense of it by looking at it briefly.

Steps

You may structure your sequence of map/reduce tasks at your choice. However, I recommend the following sequence of steps:

1Join the input files: Initially, you have two input files (ratings.dat and movies.dat). You'll get most of the important info from ratings.dat, but it only has movie IDs rather than movie names. For the first step, you can assign names to the rated movies and drop the movie ID. This way, you can refer to movies by their name going forward. You probably want your reducer's output to be key: user id, value: (movie title, rating) (which you will use as the input of next map reduce step). Hint: ratings file has 4 items each line, while movies file has 3 each line. This difference can be used to differ input.

2Produce movie rating pairs: Next, you want to organize the movies into pairs, recording the ratings of each when a user has rated both movies (i.e. for each user, you will create pairs for every two movies). This gives you vectors to use for the similarity metrics. For example, suppose we have three users, Alice, Bob, and Charlie:

Alice has rated Almost Famous a 10, The Godfather a 9, and Anchorman a 4.

Bob has rated Almost Famous a 7 and Anchorman a 10.

Charlie has rated The Godfather a 10 and Anchorman an 8.

You would end up with records that look like:

Key: (Almost Famous, The Godfather)

Values: (10, 9)

Key: (Almost Famous, Anchorman)

Values: (10, 4), (7, 10)

Key: (Anchorman, The Godfather)

Values: (4, 9), (8, 10)

3Protip: You'll want to ensure that the keys you output are consistent for a pair of movies, for example, by putting them in alphabetical order. Otherwise, you run the risk of having two keys for a pair of movies, e.g., (Anchorman, The Godfather) and (The Godfather, Anchorman). Having more than one key for a pair is bad, as they will be treated independently (and probably sent to different machines for processing).

4Compute the similarity scores: Given keys that tell you a pair of movie names and values that contain a sequence of pairs, each corresponding to how a user rated that pair, you now have the information you need to compute similarity scores for that pair of movies. You will need to organize the data before the calculation. For example:

Movies “Anchorman” and “The Godfather” has 5 values: (4, 9), (8, 6), (4, 9), (8, 4), (7, 9), the input for the calculation will be [4, 8, 4, 8, 7] and [9, 6, 9, 4, 9]. Then you put them into the calculation script.

In Python:

The statistical calculation in Python is as follow. You may want to add the calculation process

In Java:

For two list of numbers r1 and r2, you can refer here for statistical correlation and here for cosine correlation.

5Filter and format the output: Filter out the movies that weren't specified with the -m flag and sort the output by similarity metric so that it conforms to the desired output format.

Running it

For the small data set, you can run it locally. It will probably take a few minutes to complete. You can run over the large data set locally too, if you want, but it will take at least a few hours. Instead, let's farm it out to Amazon EC2 platform.

Specially, if you are writing you map reduce job in Python, the “mrjob” module we used in the warm-up project will be helpful. The tutorial of the package can be found here. It comes with the links to run this on Amazon EMR or Google Cloud. To run on Amazon, you'll need to tell MRJob to use "emr" as its runner. You'll also need to give it some basic configuration information. Edit your mrjob.conf with the following contents (more information here):

runners:

inline:

base_tmp_dir: /local

emr:

core_instance_type: t2.micro ###use a faster instance if you get education credit from Amazon (refer here for more instance info)

num_core_instances: 4 ###use more instances if you want it to be solved faster

aws_access_key_id: [your access key]

aws_secret_access_key: [your secret key]

aws_region: us-east-2a ###can be other available zones

Attention: YAML doesn't allow tabs, so only use space in mrjob.conf; by using m3.xlarge and 10 instances, you should be able to solve the large set in a few hours.

Other than use security key sets, you can also use AWS’s IAM to configure your account. Now, you should be able to invoke MRJob with "--conf-path mrjob.conf -r emr" to run your code in the cloud.

Deliverables:

For part II, you need to include in your report the following information:

  • The source code that you have written
  • Run you script with at least 3 movies and provide the screenshot of the result. An example is (note that you should list your results with highest average correlations, not lowest as shown here):

Some useful links

Install Java and Hadoop:

ssh to localhost without password:

Tips:

In the warm-up project, we only used one mapper and one reducer. But in this one, you will need multiple mappers/reducers. So you need to organize them by defining steps. Here is how I define them:

Also, to help you with the computation algorithm, here are the explanations of each step:

Step1: You need to take movie ID, user ID, movie names, rating from the two input files. The key to differ between two input files is the number of items in each line. The output of this step should be (key, value) sets as movie ID, (user ID, movie name, rating). Notice that key or value can be a tuple.

Step2: You don’t need any mapper after Step1. In this step, you need to create rating pairs. The output should be (movie_name_1, movie_name_2),(movie_name_1, movie_name_2, rating1, rating2). Notice that movie ID and user ID are abandoned.

Step3: You need to do the computations in this step. The output should be (movie_name_1, movie_name_2), (movie_name_1, movie_name_2, average correlation, statistical correlation, cosine correlation, number of co-occurences)

Step4: In this step, you need to take information from the system parser and output the similar movies of interested movies.

Step5: I add this step since the result from last reducer is not yet sorted. Also, you can take parser information such as “the number of item” and print the final results.