CS 3100 Programming Project #4
Fall 2001
HASHING SOCIAL SECURITY NUMBERS
DUE DATES:
· Level 03 Program is due Monday, December 3. Submit all your source files plus a script showing your test runs. You must demonstrate adequate testing of the program. Use the same procedure we have been using to send your files to me electronically. Send your file as an email attachment before midnight.
· Final Level Program is due Monday, December 10. Submit all your source files plus a script showing your test runs. You must demonstrate adequate testing of the program. Use the same procedure we have been using to send your files to me electronically. Send your file as an email attachment before midnight.
Write a program that implements a hash table of social security numbers.
The lists of social security numbers will be provided to you. Except for the input file of social security numbers, all input will come from standard input and all output will go to standard output. Here are the contents of a sample input file.
ssn15 – file containing 15 social security numbers
20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
Here are some sample (interactive) operations:
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q
The first number in the input file (20 in the example) is an integer, minTableSize. After reading it the program will calculate the table size, and allocate a hash table of that size. The table size must be the smallest prime number not less than minTableSize. In the example above, the table size will be 23.
The second number in the input (15 in the example) is an integer, numKeys. After reading numKeys, the program will read numKeys keys (social security numbers) from the input, formatted as shown, and insert each one into the hash table. (No two of these keys are allowed to be equal.) There must be exactly numKeys keys following numKeys. In the example, there are exactly 15 keys.
The program will calculate the hash address by treating the key as a 9-digit number N and taking N % tableSize as the address. In the example above, 041-39-9350 hashes to the address
41399350 % 23 = 17
N.B.: C++ interprets an integer with a leading zero as an octal constant, and so the expression above would be incorrect if written as 041399350 % 23 = 17.
The program will use counters to keep track of how many keys have hashed to each possible address. The possible addresses are 0..tableSize-1. Note that these counter values are independent of the collision resolution policy. When a key is hashed, a counter corresponding to the original hash address is incremented, regardless of where the program decides to actually place the key.
After processing the numKeys key values, the program will output a tabular report showing how many of the locations in the hash table had their counter end up equal to
0, 1, 2,.. up to the largest size of any counter. The report will also compare that data with the numbers predicted by the "perfect randomizer" expression:
exp(-alpha) * (alpha ** k) / k!
This expression, simply put, is the expected frequency of locations in the table whose counter stops at k, if the hash function has no "bias." The frequency is expressed as a decimal fraction of the table size. The quantity alpha is the "load factor", i.e., the ratio of numKeys to tableSize.
For example, suppose that minTableSize is 1000 and numKeys is 800. Then tableSize is 1009, alpha = 800/1009 = 0.793, and for k = 3 the formula exp(-alpha) * (alpha ** k) / k! = (0.453) * (0.498) / 6 = 0.038 (approximately).
Thus, if we are hashing 800 keys into a table with 1009 slots, we would expect that a hash function that is a perfect randomizer would result in about 38 positions with counter equal to 3.
Your program's output will show the number 38 and the actual number of locations that had counter size 3, so that a comparison of the experimental and ideal values can be made. The report will be set up so that this comparison can be done for all counter sizes that occur.
One good way to do this is to print a 4-column report: first column is for the number k, second column for the predicted values, third column for the actual values, and fourth column expressing the actual value as a percentage of the ideal value.
After printing the tabular report, the program will process commands coming from the input. The possible commands are s, r, i, d, and q.
The s command stands for show. In response to this command, the program will print a representation of what is in the hash table. This should be a two-column report: first column is table indices, and second column is key(s) stored in the location given by the index.
A couple of things to note here: First, in this case you will be reporting on the locations where the keys are actually stored. If you are using open addressing, this is not necessarily the address the key hashes to initially. Second, if you are doing chaining, there could be more than one key in a given slot.
The s command is not meant to be used for large tables, so just write the code so it uses one line for each occupied array slot. Also, write the code so it entirely skips any empty slot. No index is written in that case and no line is used up on the output. The program has to write the social security number in "standard form" -- that is with dashes in the right places.
The r command stands for retrieve. This command requires a single social security number (a key) to appear right after the r. The program will find the key in the table, report which address the key hashes to, the current value of the counter for that address, and also which address the key is actually stored in. Each of these three pieces of information must be appropriately labeled. For example:
hash address: 4 ; counter: 3 ; location: 6
If the key is not found, an appropriate message must be printed.
The i command stands for insert. This command, just like r, requires a key value next in the input. The program will insert the key into the table, unless the table already contains such an element, or unless the table is full. If insertion fails the program will print an appropriate message. If insertion succeeds the program will print the same three pieces of information as the r command. Of course the counter value will be the new value: one greater than the old value, because of the insertion.
The d command stands for delete. This command requires a parameter just like r and i. In this case the program will delete the element if it is present. Like r and i, it will print the hash address, new counter value, and actual location. If the operation fails because the key was not found (the only valid reason) then the program will print an appropriate message.
The q command stands for quit.
There can be any number of s, r, i, and d commands after the initial input. They can appear mixed into any order. The numbers minTableSize and numKeys will both be positive integers and numKeys will be less than minTableSize.
ECHOING
Except for the list of keys that comes after numKeys, the program will echo each line of input before writing the output called for by that line of input. The output for different commands will be appropriately delimited by lines of pound signs (#) or some such character.
CALCULATION OF PRIMES
In this program you need to have a function that inputs a number m and returns the smallest prime number p >= m. We can discuss in class too, and I can give you some simple pseudo code.
DYNAMIC ALLOCATION OF ARRAYS
This program has to allocate space for the hash table based on the size value from the input. See page 171 in your textbook for information on how to allocate an array dynamically.
CHOICE OF IMPLEMENTATION
If you want to save time writing this program, consider using chaining as your collision resolution policy. Unless you get a lot of help by reusing classes from somewhere, chaining is probably the easiest way to implement. Deletion is a little complex when open addressing is used. You can use the linked list class from the textbook to implement chains.
Page 2 of 4