Wikipedia access log analysis using Apache Pig

We are going to use Pig to find pages from the Dutch Wikipedia that are highly popular. The data set contains Wikipedia logs from January till March 2011. The dataset is 160GB in size.

The data format is as follows:

en Barack_Obama 997 123091092

(Language Page Hits Size)

Pig documentation: this, start up an EMR cluster similar to the first practicum.
Go to EMR Advanced cluster configuration screen, set software configuration, hardware configuration and EC2 Security groups (re-use your "Hue" (you might have given it a different name) group from first practical). Alternatively, you can also simple clone a cluster that you used earlier with Hue. Then launch your cluster and wait for Hue to come up and access it.


In Hue, click on New Document/Pig script

Here is a PIG snippet that does the ugly loading part for you (plain text file available at 's3://wikistats-lib/wikistat.jar';

SET default_parallel 10;

SET job.name 'wikistats';

SET mapred.max.map.failures.percent 1;

raw = LOAD '$DATASET' USING PigStorage(' ','-tagFile') AS (filename:chararray, lang:chararray, page:chararray, hits:long, size:long);

rawf = FOREACH raw GENERATE filename, lang, page, hits;

onlywp = FILTER rawf BY (NOT (lang MATCHES '.*\\..*') AND hits > 1);

decoded = FOREACH onlywp GENERATE lang, nl.cwi.da.wikistat.UrlDecode(page) AS page, ToDate(SUBSTRING(filename,11,22),'yyyyMMdd-HH','UTC') AS date, hits;

filtered = FILTER decoded BY (page MATCHES '[A-Za-z0-9_-]+');

ILLUSTRATE filtered;

-- your code here

-- finally something like this

-- STORE finalresult INTO 's3://yours3bucket/results42' using PigStorage();

The UI (as in “User interface”; i.e., a window will pop-up) will ask for a dataset, use small one for testing

s3://wikistats-2011/pagecounts-20110201-170000.gz

Slightly bigger (after script works on small dataset):

s3://wikistats-2011/pagecounts-2011020*-170000.gz

later, big one:

s3://wikistats-2011/pagecounts-*.gz

Task details

Using Pig,

●filter the dataset (variable "filtered") to only include pages from the Dutch Wikipedia (lang equals 'nl')

●filter out the Wikipedia Main page (page not equal to 'Hoofdpagina')

●for every Month, list the five most popular pages (with the largest sum of hits per month)

●Bonus, as before: Visualize results or link to real events

●Bonus II: Filter out other meaningless pages that you might find in the results

Hints:

●Before running the bigger dataset, resize your cluster to 10 workers (after the job has started running).

●Use ILLUSTRATE liberally to see what is happening to your data

●ILLUSTRATE might not show all tuples. On the small dataset, you can use DUMP variable; to show them all.

●On the large dataset, the Hue status will show 0% a long time. Don't despair, you can check the Job Browser and see your jobs progress. Also check the "monitoring" tab in the EMR cluster status.

Put the script, the output and the log into your report. And as always, make sure to shut down your cluster once you're done.

Here is what the "Container Pending" Graph (in the “monitoring” tab on the EMR cluster status page) should show after some time...