Interim Design Report for the

Idaho State Board of Education Document Search Tool

Prepared by: Data Miners Senior Design Team

Team Members:

Dallas Stinger

Wenlong Huang

Aaron Phillips

Date: December 10, 2010

Executive Summary

The Data Mining team designed and implemented a software application that allows the Idaho State Board of Education to search a large collection of text-based documents. The application (ISBEDigger) is designed to run on Windows XP, Windows Vista, and Windows 7. It was written in the C# programming language, using the Microsoft .NET Framework.

The software has several run-time phases. The first is called indexing. Our software pre-analyzes all available documents, recording each unique word in the document, where it occurs within the document, and how many times it occurs. The results of this phase are stored in an index file, which is used to search the documents for keywords.

The second phase is searching. The software utilizes multiple techniques to search documents based on inexact search parameters. The first is what is known as “stemming”. Stemming involves finding the root word of the given search string, as well as “stems” of those words, in order to perform a broad search of the index file. Our software also uses a large thesaurus to build a list of synonyms, which are also used to search the index file.

The third phase is the results phase. When the user performs a search, a list of documents found to contain relevant information is displayed. The displayed files are ranked according to how well they reflect the initial search parameters. When the user selects a file, the text from the original file is read in and displayed, with relevant keywords highlighted.

ISBEDigger is currently able to recognize several different file-types. It can index Microsoft Word, Excel, and plaintext files, as well as PDF files that do not require optical character recognition.

Table Of Contents

1.0 Introduction

1.1 Background……………….………….……………………….………..…….…………...………....

1.2 Objective.....................................................................................................................................

1.3 Run-Time Modes…………………………….…….….…………….……..…….….………….….

2.0 Problem Definition

2.1 Search a Collection of Documents Efficiently……………….………..…….…….….…

2.2 Search Text Without Exact String Specification…………………………...…….…....

2.3 Handle Multiple File-Types……………………………………………….……...…….….…..

2.4 Multi-User Access From Windows-Based Computers….….………….….……..….

2.5 Correlation Between Documents…………………………………..………..……….……..

3.0 Concepts Considered

3.1 Operating System and Underlying Framework……………………..…….….…….…

3.2 Native VS. Web-Based Interface……………………………………..………..……….…….

3.3 Search a Collection of Documents Efficiently………………….…….…………..….….

3.4 Finding Useful Results Without Using Exact String Searching….….….…..……

3.5 Handling Multiple File-Types……………………………………………….….…..….……...

3.6 Correlation Between Documents………………………………………….…….…………

4.0 Concept Selection

4.1 Operating System and Underlying Framework………………….……………..…...

4.2 Native VS. Web-Based Interface…….…………………………………...……….……..…

4.3 Search a Collection of Documents Efficiently……………………..……….…….…...

4.4 Finding Useful Results Without Using Exact String Searching….……...……..

4.5 Handling Multiple File-Types………………………………………………….……....……

4.6 Correlation Between Documents…………………………….…….……..……….………

5.0 Selected Design

5.1 User Interface……………………………………………………….……….………………..…..

5.2 Reverse Indexing………………………………………………….……………..….………..….

5.3 Searching……………………………………………………….……..…….…….……………..….

5.4 Reading File-Types…………………………….….…….……….……………….…………..…

6.0 Future Work

6.1 Scheduled Delivery………………………………………….………….….….…….………..…

6.2 Future Semesters…………………………………..……………….…….…….………..………

Appendix A – Requirements Document………………………..…….…….…….……..………….

Appendix B – Searching Flowchart………………………………..…………….….…..…………….

1.0 Introduction

1.1 Background

The client for this project was the Idaho State Board of Education. The State Board currently has a large collection of data stored on their internal network. This large data set contains several different file-types. It includes meeting minutes from State Board meetings, as well as documents related to decisions brought to the board for approval. Due to the large size and semi-unorganized structure of this information, it is difficult to thoroughly search through and find relevant information at a later date which is a necessary task. Therefore, there is a desire to have a tool to search through the documents quickly for relevant information.

1.2 Objective

The goal of the Data Miners senior design team was to develop a piece of software that assists the State Board employees in searching the collection of documents for relevant information. We did this by creating an easy-to-use search tool that runs on most Windows-based computers, and can search several modern types of electronic text-based documents. We designed the software to search not only using the given search parameters, but also to allow the user to search effectively without knowing an exact text string. We did this by utilizing a well-known stemming algorithm, as well as through the use of a thesaurus. Our objective was to go through one iteration of the research/design/implementation/testing cycle, and deliver a working prototype by the end of the semester.

1.3 Run-Time Modes

The ISBEDigger program has three distinct modes of run-time operation. The first mode, which is required for basic functionality, is called indexing. Indexing involves pre-analyzing all documents that the user has selected for searching. The indexer will create a list of all unique words, where they occur, and how many times they appear in each document. This information is stored in a compact index file, which is used during the search phase. Once the user has indexed the desired files, he/she can enter search parameters using one of two search modes. The second mode is searching.

The software will then scan the index file using the given parameters, as well as parameters that are calculated using stemming and a thesaurus. Relevant documents are presented to the user in order of calculated relevancy. The third mode allows the user to select documents to preview, as well as access the original files. If the user chooses to preview a document, the document text will be displayed, with relevant words highlighted.

2.0 Problem Definition

Below we outline several major project requirements. See Appendix A for a complete listing.

2.1 Search a collection of documents efficiently

In order to find relevant information currently, a State Board employee must search through the shared network drive that contains the set of documents representing all meeting minutes and board decisions. This requires knowing which document to look in, when it was created, and where exactly in the document the pertinent information resides (some documents are over 1,600 pages in length). Our software needs to be able to search through all of this information and provide accurate results in a timely manner. Our goal was to provide search results in under one second, given reasonable search parameters.

2.2 Search text without exact string specification

An important requirement that our software needs to satisfy is the ability to provide useful search results without requiring exact keywords from the user. We spent significant resources researching and implementing software that would broaden the search results to include text that was related to the given search parameters.

2.3 Handle multiple file-types

The State Board has been accumulating electronic documents for almost 20 years. This means that the documents we need to search represent a wide variety of file-types. Our software should read Microsoft Word, Excel, PDF, WordPerfect, and plaintext files. Our software must also be able to display a preview of a selected document, as well as open the original file. This requirement involved a significant amount of research into programmatic processing of these file-types.

2.4 Multi-user access from Windows-based computers

The computers that the State Board employees use run Microsoft Windows, and as such our software must be able to run in that environment. We also must also allow for multiple employees to access the software at the same time.

2.5 Correlation between documents

The documents that are created by the State Board often have shared or related content between separate files. The State Board employees would like to be made aware of related information, as well as time-sensitive information that exists in the documents.

3.0 Concepts Considered

3.1 Operating System and Underlying Framework

One of the major requirements we were given was that our software must be able to run on Windows-based machines. We took this to mean Windows XP, Windows Vista, and Windows 7. This limited the choice of languages and libraries available to us. We considered several popular programming platforms:

· C#:

o A Microsoft product, C# is a large, well-documented platform that was an excellent candidate for our project. Two members of our team have prior experience with the language, and it is built from the ground-up to work on Windows. It also provides support for COM components, which would turn out to be crucial to our ability to support Word and Excel files.

· C++:

o All three team members have prior experience with C++, and it is able to run in a Windows environment. It is also the fastest of all the languages that we considered, which made it an excellent option, considering the data-processing-intensive indexing and searching that were required

· Python:

o Python was considered due to the ease with which we could have created a web interface for our application. The downside to Python is that because it is an interpreted language, it runs slower than our other candidates.

· Java:

o Java is well documented, and is also well supported in the Windows environment. However, it is not as tightly-couple with Windows as C#, and does not provide the speed boost that C++ does. Also, only one team member had prior experience with Java.

3.2 Native vs. Web-Based Interface

One of the requirements given was that the State Board wanted the application to be accessible from the web. This differs from the traditional solution, which would be to run the software on each user’s machine. There were several concepts to consider when we made this decision:

· Permissions:

o Some of the employees that use our software may not have access to all possible documents. Some documents will have file permissions that restrict viewing. This means that a system would have to be developed to determine who has access to what when accessing remotely. This would likely involve a database that logs remote user credentials.

· Indexing:

o If the application was web-based, the software would only have to index once day on a server, instead of once per user. This would simplify the indexing process.

· Machine Learning:

o If the software runs on a user’s machine, it becomes much easier to keep track of which documents an individual accesses most, and weight them accordingly in future searches.

· Research and Implementation Time Required:

o The time required to implement a web-based interface would be significantly higher than a native one. The prior experience of the team members, challenges that come with client-server communication, and security are all hindrances to a web based interface.

3.3 Searching Documents Efficiently

The major algorithmic problem we encountered was how to best search through the large collection of documents. We researched and considered two approaches:

· Real-Time Search:

o Real-time searching is simple and easy to implement. This algorithm involves opening each file in the system, and scanning each word for matches with the search criteria. The benefit of this method is that there is no initial overhead. Unfortunately, there becomes a significant increase in search time as the size of the text grows.

· Reverse-Indexing:

o Reverse-indexing is the process of analyzing all documents before doing any searching. The words, as well as their location and frequency, are recorded in an index file, which is then searched at run-time. This algorithm has a large initial overhead, but greatly decreases search time, and makes it possible to do inexact searching.

3.4 Finding Useful Results without Exact String Searching

Another major requirement that our system needed to satisfy was the ability to search for text without knowing exact keywords. We researched and implemented two separate methods in order to accomplish this:

· Stemming:

o Stemming involves finding the root word of all search parameters, as well as “stems” of this word, which are the combinations of the root word with possible suffixes. This allows us to provide search results that are extremely close in content, without needing an exact match between search keywords and results. We had the option of using one of several known stemming algorithms, or creating our own custom algorithm. A custom algorithm would provide results that are tailored to the context of the State Board, but would require much more time to implement.

· Thesaurus

o We decided to use a thesaurus to generate a larger set of keywords when searching. Like with the stemming algorithm, we were faced with the choice of creating our own thesaurus, which would have words relating to our specific domain, or using a pre-built thesaurus, which is much easier to implement.

3.5 Handling Multiple File-Types

We were given a broad range of file-types that needed to work with our system. We considered several file-types, as well as methods of implementation.

· Microsoft Word and Excel

o Microsoft Word and Excel files are two of the more common file-types used by the State Board. We considered writing a piece of software that would parse these files, as well as existing solutions. Writing a parser for these files would be an extremely time-consuming task, as Word/Excel files have a vast array of options that must be accounted for. The two viable options that we had were using COM interop components (provided by Microsoft), or use third-party software (pay-to-use).

· PDF

o Most of the sample files we had access to were of this type. There were several different options to consider here. First, we had to decide whether or not to attempt to read Optical Character Recognition (OCR) files, which are PDFs that are created from scanned images. The simple case, which are text-based PDF’s, are much easier to read. We also had to consider whether we wanted to write our own parser, or use third-party software.

· WordPerfect

o WordPerfect files are not abundant in the State Board’s system, as most have been converted to PDF or Word documents. Also, since it is an older file-type, it will be more difficult to incorporate with our system. These facts placed WordPerfect files low on our priority list.

· Plaintext

o Plaintext files are not abundant on the system, but are extremely simple to read.

We also faced a design decision in regards to how best structure our code to allow for easy re-use and extension. We designed a class hierarchy that would work well for the file-types we described above and allows more to be added in the future.

3.6 Correlation between documents

We considered several different methods for correlating related documents. We wanted a way to group documents together both automatically and manually. We proposed the concept of “tagging,” as well as date-correlation. Tagging involves creating and maintaining a set of tags that are applied both manually and automatically (by using a set of rules to classify documents) to documents that are added to the system. This would allow employees to group documents that contain information on similar projects or ideas. Date-correlation essentially means grouping documents together based on the dates that they were created or last modified.