JavaJumpStart Session #2

Building an Index for Words

This project will be quite a bit different from the project you did during the first session. This project will have no GUI, will only interact with the user via command line arguments, will read and write files on disk, and, therefore, will be an application with a main method instead of an applet. In the first version of this project, the program will open the specified file, read the data word-by-word, and build a sorted index of words that gives each line number the word appears. To give you an idea of what the output to the screen looks like, here is the start of the index of words where the Robert Frost poem, “A Road Not Taken”, was processed.

a [1, 16, 18]

about [10]

ages [17]

all [20]

and [2, 3, 4, 7, 8, 11, 17, 18, 20]

another [13]

as [4, 6, 9]

back [15]

be [3, 16]

because [8]

bent [5]

better [7] and so forth

In the bonus part of this project, you will add a filter that will not put common English words in the index. Here is the start of the index when the filter is used:

ages [17]

another [13]

back [15]

because [8]

bent [5]

better [7]

black [12]

both [2, 11]

claim [7] and so forth

The steps described below assume you start with the program illustrated in class, WordFrequency2, and modify it to build a word index.

(Step 1) Some overall changes

·  Change the class name and the file name from WordFrequency2 to WordIndex

·  Change the name of the static class Count to the static class Index

(Step 2) Detailed changes to the static class Index

·  The first field in the class can remain the String named word, but the second field will now be a list of line numbers. It is suggested you use a TreeSet for this field (give it an appropriate name) since you want to keep the set sorted

·  Change the name of the constructor as appropriate

·  The second parameter to the constructor will represent the line number where a new word is first encountered in the text being read, so you will have to create your new TreeSet and insert this value as the first value.

·  Hint: the TreeSet is expecting to store objects, but the primitive type int is not an object. You can fix this by using the wrapper class Integer, as in: new Integer(<int variable>)


(Step 3) Detailed changes to the main method declarations

·  Change the declaration of a HashMap (which is unsorted) to a TreeMap (which is sorted)

·  Move the declaration of the BufferedReader in to the inside of the try command; change the argument from “new InputStreamReader(System.in)” to “new FileReader(args[0])” since we will be attempting to open a file based on a command line argument

·  Add a new int variable lineNumber and initialize it to zero

·  Change the declaration “count” of type Count to “index” of type Index

(Step 4) Changes to the while loops that gather data

·  Remember to increment the line number at an appropriate place

·  Change “count” to “index” and Count to Index, as appropriate

·  If it is the first time the word is put in, call the Index constructor with appropriate values

·  If the word has already appeared in the index, add to the TreeSet field in the index the line number value (don’t forget to convert the int to the object Integer)

(Step 5) Changes to the printout of results once the entire file is read

·  Change “count” to “index” and Count to Index, as appropriate

·  The item you print out will no longer be count.i but will be the TreeSet in your index

Test your program and make sure it works correctly in various applications. Three English language text files are provided: “The Road Not Taken” by Robert Frost, “The Gettysburg Address” by Abraham Lincoln, and “I Have a Dream” by Martin Luther King. You can also test your program on a java source file and see what will be produced.

BONUS SECTION: Add the ability to read a file of “common words” for the given application and only put in the index the words that do not appear in the “common words”

Change the main method in the following ways

·  Change the class and file name to WordIndexWithFilter

·  Declare a HashSet called keywords; a HashSet is unsorted, which is all we need

·  Inside the try command, declare a BufferedReader called filter; use args[0] to open filter and args[1] to open in, which already exists

·  Create an initial set of nested while loops that read in the words to be filter and add them to the keywords HashSet; you can use the existing nested loops as a model, but this will be simpler. Don’t forget to covert to lower case and to trim off any white space.

·  In reading in the source text, add an if command that checks the condition !keywords.contains(word) before putting a word in the index

Test your program and make sure it works correctly in various applications. Two files for filtering are provided: common_english_words.txt for the 100 most common English word and java_keywords.txt for the reserved word in java. Here is a sample command line:

java WordIndexWithFilter common_english_words.txt road_not_taken.txt

CONGRATULATIONS! Your have completed the lab project associated with the second session of the JavaJumpStart Tutorial.