Nick Crowder, Szu-Kai Andy Hsu, Will Mecklenburg, Jeff Morris, David Nguyen

Team Members:

Nick Crowder, Szu-Kai “Andy” Hsu, Will Mecklenburg, Jeff Morris, David Nguyen

Computational Linguistics

Senior Capstone CS 4984

Virginia Polytechnic Institute and State University

Blacksburg, VA

Team Hurricane

December 12 2014

Table of Contents

Executive Summary …………………………………………….…………….… 4
User Manual ……………………………………………………..……………….. 5

Introduction

Running Files

Files for Users

Developer Manual ……………………………………………………………….. 7

Introduction

Files for Developers

Introduction ………………………………………………………………………. 9

Group Collaboration

Tools Used

Unit 1 - Summarize with Word Frequencies ………………………………… 11

Intro

Data

Conclusion

Unit 2 - Summarize with Corpus Characteristics ………………………….. 12

Intro

List of Stopwords

Results

Conclusion

First Cleaning Attempt

Unit 3 - Summarize with Nouns and POS Tagging ………………………… 15

Intro

Results

Problems Encountered

Conclusion

Unit 4 - Classifying and Using N-Grams ………………………………...….. 17

Intro

Results

Conclusion

Unit 5 - Summarize with Name and Entities …………………………...…… 19

Intro

Results

Problems Encountered

Second Cleaning Attempt

Conclusion

Unit 6 - Summarize with Topics ………………………………………...…….. 21

Intro

Results

Conclusion

Unit 7 - Summarize with Indicative Sentences ………………………..……. 23

Intro

Results

Conclusion

Unit 8 - Summarize with Filled in Templates ……………………………...… 25

Intro

Results

Conclusion

Unit 9 - Summarize with an English report ……………………………...…... 27

Intro

Results

Issues

Conclusion

Conclusion …………………………………………………………………..……. 29

Analysis of Project Management Tools

Analysis of Summarization Techniques

Solution Analysis

Lessons Learned ……………………………………………………………..…. 31
Acknowledgments …………………………………………………………..…... 32
References ……………………………………...…………………………..……. 32

Executive Summary:

This report presents the findings and results of our semester-long project to automatically generate summaries of two different collections of documents about hurricanes. The collections summarized include a small corpus about Typhoon Haiyan, containing ~2,000 files and a large corpus about Hurricane Sandy, containing ~70,000 files. One of the topics that is continually mentioned throughout the paper is the cleaning of the corpora, as they initially contained a lot of duplicates and noise. Since many of the methods discussed in the paper will not run effectively on datasets with lots of noise, we spent a lot of time and effort on cleaning the corpora.

The paper will begin by talking about the simpler methods used to summarize the collections. These methods include finding the most frequent words, using a WordNet, and using a part of speech tagger. Next the paper discusses how we used a classifier to determine if documents were about hurricanes or not. We then talk about using latent Dirichlet allocation (LDA) with Mahout to extract topics from the two collections. Next we talk about summarizing using k-means clustering with Mahout on the sentences in the two corpora in order to find a centroid, or a representation of the average sentence in the corpus.

The last two techniques we talk about both involve extracting information from the corpora with regular expressions. We first used the extracted data to fill out a template summarizing the two corpora. Then we used the same data alongside a grammar to generate a summary of the each of the corpora. After each summarization technique is talked about in detail, the conclusion summarizes the strength and weaknesses of each technique and then address the strengths and weaknesses of our specific solution to summarize the two corpora. Lastly we talk about the different lessons that we learned throughout the semester.

User Manual:

Introduction:

To use our Python scripts, a user must have Python 2.7 installed on their machine. The user must have NLTK dependencies installed, and must have downloaded the nltk tools and corpora via nltk.download(). The user needs installed gensim to run certain files(specified by the file name). Lastly the user needs access to a Hadoop cluster to run some of the files(specified by the file name). To run all of our Python modules, please edit the paths within properties.py to the local corpora paths prior to running any scripts.

Running Files:

All modules can be run from the command line using the default Python syntax:

python filename.py

If an IDE is being used, just run the specific Python file.

To run the mappers and reducers on Hadoop, see this example command:

hadoop jar pathToHadoopJar.jar -mapper mapper.py -reducer reducer.py -input InputDirectory/* -output outputDirectory/ -file mapper.py –file reducer.py

Replace pathToHadoopJar with the path to the specific Hadoop jar for that cluster. Replace the mapper.py and reducer.py with the correct mapper and reducer files.

Files for Users:

Classifier.py: creates feature set for input corpus and outputs the accuracy of the classifier
LDAtutorialCode.py: outputs topics and probability of each word to occur in the topics from the input corpus, must have gensim installed to run
Mapper.py: passes each noun to the reducer.py, must be run on Hadoop cluster with reducer.py
Ngrammapper.py: passes each ngram from the input corpus to the ngramreducer.py, must be run on Hadoop cluster with ngramreducer.py
Ngramreducer.py: counts occurances of each ngram passed in by the ngrammapper.py, must be run on Hadoop cluster with ngrammapper.py
Properties.py: contains paths to corpora and tools such as Stanford NER, must change paths for personal use
Reducer.py: counts number of occurances of different things depending on what mapper it is being used with, output number of occurances of items passed to it
Regex.py: extracts information from input corpus using regexes and ouputs a summary of the corpus using the grammar in the file
RegexTemplate.py: extrats information from input corpus using regexes and outputs the information on a pre built template
U2YourSmallStats.py: runs statistical analysis on input corpus and plot information such as the number of instances of different size of words, and percentage of words that are not stopwords
Unit2.py: outputs percentage of times each word in our relevent word list appears relative to the total words in input corpus
Unit3.py: outputs most frequent nouns in the input corpus that are not in stopword list
Unit4.py: takes a small training set of labeled negative and positive documents, trains classifier, and then runs classifier on rest of the collection, outputs the accuracy of the classifier
Unit5.py: uses chunking on input corpus to find different patterns and the number of their occurances
Unit5Mapper.py: uses chunking on input corpus to find different patterns, passes each one to the reducer.py, must be run on Hadoop cluster with reducer.py

Developer Manual:

Introduction:

A developer must have Python 2.7 installed on their machine. The developer must have NLTK dependencies installed, and must have downloaded the nltk tools and corpora via nltk.download(). The developer should also have gensim on their machine to work with some files (specified next to filename). The developer also needs access to a Hadoop cluster to work with certain files (specified next to filename).

If a developer is interested in improving and making extensions to the current project they should begin by looking at Regex.py. Regex.py uses a grammar to summarize the corpora and regexes to pull out information to be used by the grammar.

Files for Developers:

Baseline.py: oututs percentage of times eachword in our relevent word list appears relative to the total words in Brown, Reuters, and State of the Union corpora
Classifier.py: creates feature set for input corpus and outputs the accuracy of the classifier
LDAtutorialCode.py: outputs topics and probability of each word to occur in the topics from the input corpus, must have gensim installed to run
Mapper.py: passes each noun to the reducer.py, must be run on Hadoop cluster with reducer.py
Ngramfinder: finds and outputs n-grams(bigrams as is) not containing stopwords
Ngrammapper.py: passes each ngram from the input corpus to the ngramreducer.py, must be run on Hadoop cluster with ngramreducer.py
Ngramreducer.py: counts occurances of each ngram passed in by the ngrammapper.py, must be run on Hadoop cluster with ngrammapper.py
Properties.py: contains paths to corpora and tools such as Stanford NER, must change paths for personal use
Reducer.py: counts number of occurances of different things depending on what mapper it is being used with, output number of occurances of items passed to it
Regex.py: extracts information from input corpus using regexes and ouputs a summary of the corpus using the grammar in the file
RegexTemplate.py: extrats information from input corpus using regexes and outputs the information on a pre built template
SentenceTokenization.py: splits up the input corpus by sentences
Setence.py: makes a file for each sentence in the corpus, filters sentences by length, sentences can be put in directory and have k-means run on them
Statistics.py: returns the percentage of words in the input corpus not in the stopword list
TextUtils.py: contains functions to filter numbers, words with non alpha charcacters, and non alpha characters from the input lists
TextUtilsTag.py: functions for filtering empty lines and words with specified tags
U2YourSmallStats.py: runs statistical analysis on input corpus and plot information such as the number of instances of different size of words, and percentage of words that are not stopwords
Unit2.py: outputs percentage of times each word in our relevent word list appears relative to the total words in input corpus
Unit3.py: outputs most frequent nouns in the input corpus that are not in stopword list
Unit4.py: takes a small training set of labeled negative and positive documents, trains classifier, and then runs classifier on rest of the collection, outputs the accuracy of the classifier
Unit5.py: uses chunking on input corpus to find different patterns and the number of their occurances
Unit5Mapper.py: uses chunking on input corpus to find different patterns, passes each one to the reducer.py, must be run on Hadoop cluster with reducer.py
WordDB.py: contains a list of stop words that we created, used for filtering out noise

Introduction:

This paper pertains to the Computational Linguistics course or CS4984 taught by Dr. Fox in Fall 2014. The course covers the use of various linguistics tools to analyze corpora, specifically file collections scraped from the Internet. The focus for our group was to analyze two different collections of documents focusing on Hurricanes that made landfall. We were initially given a ClassEvent corpus containing a collection of files about heavy rains in Islip, New York. The purpose of the ClassEvent was to rapidly test techniques on a small, clean set of data before attempting them on the small and large collections.

The YourSmall corpus, our smaller collection of files, contained two thousand files pertaining to Typhoon Haiyan, however after removing insignificant files and duplicates, only 112 remained. The other collection was named YourBig, and primarily focused on Hurricane Sandy. YourBig was initially ~70,000 files, but was reduced to 2,000 after deleting irrelevant files. We wrote all of our code in Python using Eclipse and gedit (on the cluster).

Our first attempt at summarizing the corpora used the Hadoop open-source file management and MapReduce framework to process the large sets of data for the purpose of finding patterns in the data, as well as discovering the higher level topics of the corpora. Following this, we began the use of NLP (natural language processing) specific tools such as K-Means and Mahout to analyze the corpora in more meaningful ways. These efforts culminated in a generated template describing the event, which was filled in using data extracted from the documents.

Group Collaboration:

To work effectively as a group we utilized a variety of techniques and tools to coordinate our efforts. Initially, we set up a Google Drive account that managed all of our files, however we quickly realized this method was ineffective for code management. We then created a Git account to maintain the Python scripts. In addition, we used Trello, an online task board, for assigning and tracking jobs. Lastly, we used GroupMe, an online messaging client with SMS features, for centralized communication.

Tools Used:

Hadoop: Hadoop is an open-source framework that we used to manage the 11-node cluster.

Cloudera: A way to run Hadoop locally mainly for testing purposes.

LDA (latent Dirichlet allocation): LDA is used for obtaining topics from our corpora.

Mahout: Mahout is a tool used for clustering, classification, and filtering.

Python 2.7: A high level programming language that is compatible with all of the previously mentioned tools.

WordNet: WordNet groups words with other semantically equivalent words.

Unit 1

Intro:

This unit focused on the basics of linguistics and analyzing text. The first task was familiarizing ourselves with Python and the NLTK toolkit. The biggest issue we faced when running the various built in commands (such as creating a frequency distribution) was the number of stop words that were showing up in our result. Valuable words such as “rain” were getting pushed towards the bottom of the most frequent words list by words such as “and”. To remedy this issue, we utilized NLTK’s stopword list to filter these words out at runtime. Another thing we began to notice in this unit was the importance of bigrams. In our results, we found that the preceding word could easily determine the meaning.

Data:

We found that longer words tend to have more meaning to the summary of the corpus. Additionally, collocations were much more indicative of the corpus than single words.

Conclusion:

We discovered that frequency is a poor descriptor of what the dataset is about without filtering out stop words. Also, just looking at frequency gives no context to how words were being used. Collocations proved to be more relevant because the first word gives context to the second word and vice versa.

Unit 2

Intro:

Unit 2 focused on word frequency and relations. We started by manually creating a list of hurricane related words. We then constructed a WordNet based on our list to get similar words. Finally, we searched YourSmall and ClassEvent for the words in our list, and recorded the relative frequency (# word occurrences / # total words) of each word. We also searched the Brown corpus for words in our list to establish a baseline for the frequency of our words in an unrelated corpus.

Our list of stopwords:

In addition to the Stanford NER list of stopwords, due to our unique collections we had to add to that our own list of stopwords. Many of these are web specific, since our corpora documents were scraped from the Internet (Figure 1)

'rss','subscribe','comments','abc','copyright','desktop',
'reddit','tweet','pinterest','careers','sitemap','email',
'glossary','youtube','facebook','imagesfullscreen','afpgetty',
'apfullscreen','news','people','said','epafullscreen','photos',
'trending','celebrity','agencyfullscreen','stories','captcha',
'audio','image','E-mail this to a friend','usa now','connect '
'afp\/getty'

Figure 1. Custom stop word list

We were still getting a large amount of garbage in our most frequent words list, which was a result of us not having cleaned our data effectively yet. This caused us to start expanding our list of stopwords.

Results:

Word / YourSmall / ClassEvent / Brown Baseline
hurricane / 0.0061% / 0.0246% / 0.0003%
storm / 0.0421% / 0.2539% / 0.0034%
rain / 0.0123% / 0.9628% / 0.0738%
wind / 0.0242% / 0.0295% / 0.0169%
eye / 0.0013% / 0.0059% / 0.0273%
tropical / 0.0068% / 0.0354% / 0.0022%
flood / 0.0059% / 0.6438% / 0.0044%
haiti / 0.0% / 0.0% / 0.0%
florida / 0.0% / 0.0% / 0.0%
damage / 0.0309% / 0.1063% / 0.0085%
injury / 0.0% / 0.0% / 0.0011%
displaced / 0.0026% / 0.0% / 0.0016%
water / 0.0184% / 0.3721% / 0.0292%
category / 0.0013% / 0.0% / 0.0016%
cyclone / 0.0069% / 0.0059% / 0.0%
death / 0.0113% / 0.0236% / 0.0103%
cloud / 0.0014% / 0.0532% / 0.0038%
monsoon / 0.0% / 0.0059% / 0.0%
blow / 0.0018% / 0.0 / 0.0049%
wave / 0.0053% / 0.0059% / 0.0077%

Table 1. Percentage of our words in each collection

Conclusion:

We did not account for the difference in jargon when talking about storms from different locations. When the storm occurs in the Pacific, it is called a typhoon and when it is in the Atlantic, it is called a hurricane. This caused some of our expected words to have lower than expected relative frequencies.

We were able to determine words that are frequently associated with the word Hurricane. Those words tend to be related to some kind of destruction or loss along with any word pertaining to a large storm. In addition, we started to realize the problem that our word counts were not accounting for varied ways of writing the same word, such as “Nov” and “November”.

One thing that initially surprised us was the lack of frequency of the words that were in our WordNet. However, since the articles are pulled mainly from news sources, it is likely that they all tend to use the most straight-forward language in order to provide an easy reading experience for their audience. This would cause some words to be very prominent in the most frequent words list, while their synonyms would not occur.

First Cleaning Attempt:

We realized that we desperately needed to clean up YourBig in order to obtain accurate results when running the most frequent words script. Additionally, since we still had tens of thousands of files made everything we did on the Hadoop Cluster take a large amount of time to process. We initially determined that we would try to delete files based on file size. We arbitrarily chose 10kb as a starting point, and then looked at a large number of files around 10kb to see if they contained relevant material. We determined that a lot of files at 10kb contained useful articles, so we moved our cutoff line down to 8kb. Most files around that size did not contain pertinent articles, so we proceeded to delete all files 8kb and under. This left us with 30,000 files in YourBig and 1,500 in YourSmall. We also looked at the largest files and determined that they were all large sets of raw data, so we deleted the outliers. This only deleted a few files from YourBig and none from YourSmall.

Unit 3

Intro:

In Unit 3 we focused on tagging the words in our corpora by part of speech. We then used our part of speech (POS) tagging method to summarize the corpora by the most frequent nouns, as nouns are typically the most indicative of the content of a corpus.

We trained our POS tagger on a subset of 100 files from YourSmall, utilizing a trigram tagger, which backed off to a bigram tagger, then a unigram tagger, and finally defaulted to tagging as a noun. After training the tagger, we ran it on ClassEvent and YourSmall, and were able to determine that nouns made up 50.4% of words in the ClassEvent file and 55.6% of words in the YourSmall file. We also noticed that nouns were more indicative of the topics of the corpora than verbs or adjectives or any other parts of speech, confirming the hypothesis for the unit. We also implemented a MapReduce solution to find the most frequent nouns from YourBig.