CPSC 335
Assignment 1: Extendible Hashing
Due: Friday, February 5th, 2016, 9:00 am via D2L
Introduction
The purpose of this assignment is to investigate the dynamic hashing algorithm. A hash table uses a hash function to map keys to table addresses. If the keys are character string identifiers, then there is a very large number of possible keys, much greater than the size of a hash table. The hash function maps the large number of possible strings into table addresses. A good hash function spreads out the entries in the table, avoiding collision (keys mapping to the same table address) as much as possible. When the entries are spread out, the hash function can be used to quickly enter and find keys in the table, in about O(1) time. The dynamic (extendible) hashing allows increasing or decreasing the size of the file dynamically, allocating only requires space for hash table. The method is described in detail in your course Drozdek textbook.
Program
Write a program in Java that implements an extendible hash symbol table. You are free to choose any hash function that you like, that returns the bitmask of the size of 32 bits for each word hashed (the bitmask will not be unique). Your program should do the following:
1. Take the following three parameters from the command line: the name of the file to take the keys from file_name, the number of keys to hash number_keys and the size of the bucket bucket_size. The program should be named as1. The keys represented as words, for this assignment you can assume that the size of the word is limited by 20 characters.
2. A file containing over 10000 English words used by the gnu ispell program is posted on the Course web site. You can use this file to test your program. The second smaller data set from the Horowitz book is also provided. You can use it for debugging.
3. Read keys from the file file_name sequentially. Enter each of the keys into the extendible hash table. Start with the table containing one key only, hash a word into the 32 bitmask (32 bits), place it in the first bucket, and continue hashing keys into this bucket until it is full. When the bucket becomes full, split it by two (following the rules described in the textbook) and create a second reference from the hash table onto the new bucket. Grow your data structure until the number of keys identified by the user: number_keys is reached.
4. Each bucket should be exactly 128 bytes, and can store variable number of words (stored character by character) depending on the word length. One of the ways to index words inside the bucket is to store
length_of_the_word1, word1, length_of_the_word2, word 2, ….
You can then perform sequential search within a bucket to find a specific word.
5. You must create an index within the bucket and performing binary search for a word in it to optimize processing
6. Re-read the set of keys of the specified size from the file, and look up each one in the hash table. Compute and print the average number of comparisons required to look up a key in the input file.
Written Report
Written report is an essential part of your assignment. Type and submit an electronic report (1-2 pages) following submission instructions posted on the web site that includes the following:
1. A brief description of your hash function to hash a word into 32 bitmask and justification of your choice (1 paragraph).
2. A brief description of your indexing method (within a bucket) and justification of your choice (1-2 paragraphs).
3. The average number of probes (obtained by experiments) required to look up keys in the large data file (words.txt) provided on the Course web site.
4. Analysis (theoretical) of how changing the bucket size would influence the number of probes required to look up keys and influence efficiency of hashing.
Hand in
Submit your source code and the electronic report according to your TA instructions. Read course late assignment policy in the Course Outline.
Marking
Assignment grades will be based on the two submitted components as described in the marking sheet. Source code that does not compile or produces run-time errors will receive maximum 3 points. Code that produces incorrect output will receive a maximum grade of 5 points.
Collaboration
The assignment must be done individually so everything that you hand in must be your original work, except for the code copied from the text. When someone else's code is used, you must acknowledge the source explicitly. Copying another student's work is an academic misconduct. Contact your TA if you have problems getting your code to work.