Frequent Word Combinations Mining and Indexing

Hemanth Gokavarapu Santhosh Kumar Saminathan

School of Informatics and Computing

Indiana University Bloomington

{hemagoka, sasamina}@indiana.edu

Abstract

Google AutoComplete algorithm offers searches similar to the words that you might be typing.This works on the frequent words’ combinations.

This inspired us to learn and implement similar technique involved in this process. This involves the frequent word combinations mining and indexing. This document will first describe the concept behind the project and then provide the methods and implementation details of the project. We chose HBase as the distributed, open-source database, which gives similar BigTable functionality for Hadoop

Key Words: alphabetically, sorted, excluding, words, Apriori, Mining.

1. Project Goal

The problem of finding the frequency of the word combinations is considered as one of the major problems faced in the field of cloud and distributed field, as there are solutions to just find a frequency of a single word. We cannot find the individual frequency of words in a word combination and arrive at the result. For example, for the combination of words ‘cloud computing’, we cannot find the frequency of cloud and computing separately and compute the result because there are chances for the word combinations ‘distribute cloud’, ‘computing field’, ‘method of computing’, etc.,

Our project focuses on finding the combination of words without the stated error. We are going to use the concept of data mining and Apriori algorithm for implementing this project.

2. Survey

In this project we are going to survey various topics before going to the implementation of the project. The survey topics include Apriori algorithm, HBase and Map Reduce. In addition to the survey about the mentioned topics, we are also going to do a detailed analysis in the trade off of design in choosing the HBase, Hadoop or Twister for this project.

3. Approach

The concept of data mining is used in this project. Data mining is the process of analyzing the data from different perspective and summarizing data into useful information. It also analyzes the relationships and patterns in stored transaction data based on open-ended user queries.

4. Architecture Design

The Apriori algorithm is used in the project for finding the frequency of a combination of words. It is an influential algorithm for mining frequent item sets for Boolean association rules. These association rules form a very applied data mining approach. They are derived from frequent itemsets. They use level-wise search using frequent item property.

The Apriori algorithm calculates candidate itemsets. These candidate itemsets are refined at each and every iteration. When the candidate set becomes null the loop ends. This algorithm uses larger itemset property and it is easily parallelized.

HBase is a non-relational, distributed, Hadoop database built after Google Bigtable. HBase internal architecture is shown below.

HBase basically handles two kinds of file types. One is used for the write – ahead log and other for the actual data storage. The HRegionServer’s primarily handles the files. But in certain Scenarios even the HMaster will have to perform low-level file operations. You may also notice in the diagram that the actual files are in fact divided up into smaller blocks when stored with in the Hadoop Distributed File System (HDFS).

5. Timeline

We are going to follow the timeline mentioned below. 1 week – Talking to the experts at Future grid.

1 week – Survey of HBase, Apriori algorithm and other design problems.

3 weeks – Implementation of the algorithm

2 weeks – Testing the code, evaluation and getting the results.

6. Validation Methods

There are many validation methods that can be given for this project. We are going to follow the basic approach of giving the huge data input and checking the results with the expected output.

7. References

[1]

[2]

[3]