UNIT III Hashing and Sets

Objectives

To know about hashing

To know the difference between static and dynamic hashing

To know about path expressions

1. Hashing

Hashingisamethodtostoredatainanarraysothatsorting,searching,insertingand deletingdatais fast. Forthis everyrecord needs uniquekey. Thebasicideaisnotto searchforthecorrectpositionofarecordwithcomparisonsbutto computethepositionwithinthearray.Thefunctionthatreturnsthepositioniscalledthe 'hashfunction'and thearrayis calleda'hash table'.

Types of hashing

Separate chaining

Open addressing

  • Linear probing
  • Quadratic probing
  • Double hashing

Separate chaining

separate chaining hashing uses an array as the primary hash table, except that the array is an array of lists of entries, each list initially being empty. When an entry is inserted, it is inserted at the end of the list at the index corresponding to the hash code of the key in question. Searching the hash table now involves walking down the list at the given index though, with good design, it should be a relatively short list.

In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Lookup requires scanning the list for an entry with the given key. Insertion requires adding a new entry record to either end of the list belonging to the hashed slot. Deletion requires searching the list and removing the element.

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, on the load factor.

For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table.

Position find( element_type key, HASH_TABLE H )

{

position p;

LIST L;

L = H->the_lists[ hash( key, H->table_size) ];

p = L->next;

while( (p != NULL) & (p->element != key) )

p = p->next;

return p;

}

Void insert( element_type key, HASH_TABLE H )

{

position pos, new_cell;

LIST L;

pos = find( key, H );

if( pos == NULL )

{

new_cell = (position) malloc(sizeof(struct list_node));

L = H->the_lists[ hash( key, H->table size ) ];

new_cell->next = L->next;

new_cell->element = key; /* Probably need strcpy!! */

L->next = new_cell;

}

}

Open addressing

In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.[9] The name "open addressing" refers to the fact that the location ("address") of the item is not determined by its hash value.

Well-known probe sequences include:

  • Linear probing, in which the interval between probes is fixed (usually 1)
  • Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation
  • Double hashing, in which the interval between probes is computed by another hash function

A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. In fact, even with good hash functions, their performance dramatically degrades when the load factor grows beyond 0.7 or so. Thus a more aggressive resize scheme is needed. Separate linking works correctly with any load factor, although performance is likely to be reasonable if it is kept below 2 or so. For many applications, these restrictions mandate the use of dynamic resizing, with its attendant costs.

Open addressing schemes also put more stringent requirements on the hash function: besides distributing the keys more uniformly over the buckets, the function must also minimize the clustering of hash values that are consecutive in the probe order. Using separate chaining, the only concern is that too many objects map to the same hash value; whether they are adjacent or nearby is completely irrelevant.

(i) Linear probing

The need to have a rehash function arises when a collision occurs. This happens when two or more information would collide on the same cell allocated for the hash table. Thus, a rehashing is needed.Since we know that the hash function for storing the information or data into the allocated memory is done by getting the remainder of the data’s numerical equivalent divide by the number of space allocated. Then, the rehashing function is the method for finding the second or third or so on location for the information.One rehashing technique is the Linear Probing, where the rehashing is done by looking for the next empty space that it can occupy. The function for the rehashing is the following:

rehash(key)=(n+1) % k;

This method works in such a way that if the first location is not free, then it will go to the next location and check if that location is free or not, and so on until it finds a free location or can’t find anyone at all.For formality and familiarity’s sake, an empty space would be given a -1 value while a deleted data’s space would be -2. In this way, finding an empty space is easy and also the search for a stored item would be easier.

To test this algorithm, the use of the following example is needed.

For example, we have a hash table that could accommodate 9 information, and the data to be stored were integers. To input 27, we use hash(key)= 27 % 9=0. Therefore, 27 is stored at 0. If another input 18 occurs, and we know that 18 % 9= 0, then a collision would occur. In this event, the need to rehash is needed. Using linear probing, we have the rehash(key)= (18+1) % 9= 1. Since 1 is empty, 18 can be stored in it.

To retrieve data, the hash function and the rehash function were also useful. Using the example from above, retrieving 18 is done by using the hash function to find the key and check if the data would coincide to the data needed. If not, then the rehash would be needed. Until such time the correct location is found or an empty space is encountered(that is the value of that space is -1), which means that the data does not exist. This is authentic because, the path of the search function would be the same path that was used in storing the data. In this sense, if an empty space is encountered, it all means that the data does not exist, logical isn’t it?

The use of the -2 value for deleted items is useful in such a way that in traversing the hash table, encountering a deleted cell would not end the traversal.

(ii) Quadratic probing

This is a method of accessing data with nearly O(1) if the data needs to be dynamically administered (mostly by inserting new or deleting existing data). The only data structure which guarantees O(1) is the array. But access is only O(1) if the index of the element is known – otherwise searching algorithms must be used (like Binary Search) which then have a complexity of O(n * log n).

Each data item to be stored is associated with a key on which a hash function is applied, the resulting hash value is used as an index to store the object in one of a number of "hash buckets" in a hash table. So for each object, the index at which it is stored in the hash table can be calculated. Sometimes more than one object can have the same hash value, in that case a collision procedure must determine the position for the new object. In the animation, the collision resolution strategy "Quadratic Probing" is presented. Starting from the collision point, free spaces are searched for using a quadratic algorithm. The hash values needed to search for an Object are calculated as follows: h, h+1, h+4, h+9, ... .

(iii) Double hashing

Like all other forms of open addressing, double hashing becomes linear as the hash table approachesmaximum capacity. The only solution to this is to rehash to a larger size.Double hashing uses the idea of applying a second hash function to the key when a collision occurs. The result of the second hash function will be the number of positions form the point of collision to insert.

There are a couple of requirements for the second function:

  • it must never evaluate to 0
  • must make sure that all cells can be probed

On top of that, it is possible for the secondary hash function to evaluate to zero. For example, if we choose k=5 with the following function:

The resulting sequence will always remain at the initial hash value. One possible solution is to change the secondary hash function to:

This ensures that the secondary hash function will always be non zero.

Rehashing

Once the hash table gets too full, the running time for operations will start to take too long and may fail. To solve this problem, a table at least twice the size of the original will be built and the elements will be transferred to the new table.The new size of the hash table:

  • should also be prime
  • will be used to calculate the new insertion spot (hence the name rehashing)

This is a very expensive operation! O(N) since there are N elements to rehash and the table size is roughly 2N. This is ok though since it doesn't happen that often.

Rehashing is applied when:

  • once the table becomes half full
  • once an insertion fails
  • once a specific load factor has been reached, where load factor is the ratio of the number of elements in the hash table to the table size

Deletion from a Hash Table

Itisverydifficulttodeleteanitemfromthehashtablethatusesrehashesforsearchand insertion.Supposethatarecordrisplacedatsomespecificlocation.Wewanttoinsert someotherrecordr1onthesamelocation.Wewillhavetoinserttherecordinthenext emptylocationtothespecifiedoriginal location.Supposethattherecordrwhichwas there at thespecified location is deleted.

Now,wewanttosearchtherecordr1,asthelocationwithrecordrisnowempty,itwill erroneouslyconcludethattherecordr1isabsentfromthetable.Onepossiblesolutionto thisproblemisthatthedeletedrecordmustbemarked"deleted"ratherthan"empty"and the search must continue whenever a "deleted" position is encountered. But this is possibleonlywhen therearesmall numbers ofdeletions otherwise an unsuccessful search willhavetosearchtheentiretablesincemostofthepositionswillbemarked"deleted" ratherthan"empty".

Extendible hashing

Extendible hashing is a type of hash system which treats a hash as a bit string, and uses a trie for bucket lookup.Because of the hierarchical nature of the system, re-hashing is an incremental operation (done one bucket at a time, as needed). This means that time-sensitive applications are less affected by table growth than by standard full-table rehashes.

Example

Assume that the hash function h(k) returns a binary number. The first i bits of each string will be used as indices to figure out where they will go in the "directory" (hash table). Additionally, i is the smallest number such that the first i bits of all keys are different.

Keys to be used:

h(k1) = 100100
h(k2) = 010110
h(k3) = 110110

Let's assume that for this particular example, the bucket size is 1. The first two keys to be inserted, k1 and k2, can be distinguished by the most significant bit, and would be inserted into the table as follows:

Now, if k3 were to be hashed to the table, it wouldn't be enough to distinguish all three keys by one bit (because k3 and k1 have 1 as their leftmost bit. Also, because the bucket size is one, the table would overflow. Because comparing the first two most significant bits would give each key a unique location, the directory size is doubled as follows:

And so now k1 and k3 have a unique location, being distinguished by the first two leftmost bits. Because k2 is in the top half of the table, both 00 and 01 point to it because there is no other key to compare to that begins with a 0.

h(k4) = 011110

Now, k4 needs to be inserted, and it has the first two bits as 01..(1110), and using a 2 bit depth in the directory, this maps from 01 to Bucket A. Bucket A is full (max size 1), so it must be split; because there is more than one pointer to Bucket A, there is no need to increase the directory size.

What is needed is information about:

  1. The key size that maps the directory (the global depth), and
  2. The key size that has previously mapped the bucket (the local depth)

In order to distinguish the two action cases:

  1. Doubling the directory when a bucket becomes full
  2. Creating a new bucket, and re-distributing the entries between the old and the new bucket

Examining the initial case of an extendible hash structure, if each directory entry points to one bucket, then the local depth should be equal to the global depth.The number of directory entries is equal to 2global depth, and the initial number of buckets is equal to 2local depth.Thus if global depth = local depth = 0, then 20 = 1, so an initial directory of one pointer to one bucket.

Back to the two action cases:

If the local depth is equal to the global depth, then there is only one pointer to the bucket, and there is no other directory pointers that can map to the bucket, so the directory must be doubled (case1).

If the bucket is full, if the local depth is less than the global depth, then there exists more than one pointer from the directory to the bucket, and the bucket can be split (case 2).

Key 01 points to Bucket A, and Bucket A's local depth of 1 is less than the directory's global depth of 2, which means keys hashed to Bucket A have only used a 1 bit prefix (i.e. 0), and the bucket needs to have its contents split using keys 1 + 1 = 2 bits in length; in general, for any local depth d where d is less than D, the global depth, then d must be incremented after a bucket split, and the new d used as the number of bits of each entry's key to redistribute the entries of the former bucket into the new buckets.

Now, h(k4) = 011110
is tried again, with 2 bits 01.., and now key 01 points to a new bucket but there is still k2 in it (h(k2) = 010110 and also begins with 01).If k2 had been 000110, with key 00, there would have been no problem, because k2 would have remained in the new bucket A' and bucket D would have been empty.

So Bucket D needs to be split, but a check of its local depth, which is 2, is the same as the global depth, which is 2, so the directory must be split again, in order to hold keys of sufficient detail, e.g. 3 bits.

  1. Bucket D needs to split due to being full.
  2. As D's local depth = the global depth, the directory must double to increase bit detail of keys.
  3. Global depth has incremented after directory split to 3.
  4. The new entry k4 is rekeyed with global depth 3 bits and ends up in D which has local depth 2, which can now be incremented to 3 and D can be split to D' and E.
  5. The contents of the split bucket D, k2, has been re-keyed with 3 bits, and it ends up in D.
  6. K4 is retried and it ends up in E which has a spare slot.

Now, h(k2) = 010110 is in D and h(k4) = 011110 is tried again, with 3 bits 011.., and it points to bucket D which already contains k2 so is full; D's local depth is 2 but now the global depth is 3 after the directory doubling, so now D can be split into bucket's D' and E, the contents of D , k2 has its h(k2) retried with a new global depth bitmask of 3 and k2 ends up in D', then the new entry k4 is retried with h(k4) bitmasked using the new global depth bit count of 3 and this gives 011 which now points to a new bucket E which is empty. So K4 goes in Bucket E.

2. Sets

Itisanefficientdatastructuretosolvetheequivalenceproblem.Eachroutine requires only a few lines of code, and a simple array can be used. The implementation is also extremely fast, requiring constant average time per operation.

1. EquivalenceRelations

ArelationRisdefinedonasetSifforeverypairofelements(a,b),a,b S,aR

bis either trueorfalse.IfaRbistrue,thenwesaythataisrelatedtob. AnequivalencerelationisarelationRthatsatisfiesthreeproperties:

1.(Reflexive) aR a,for all a S.

2.(Symmetric) aR bifandonlyifbR a.

3.(Transitive) aR bandbR cimpliesthataR c.

Examples.

1.Therelationshipisnotanequivalencerelationship. Althoughitisreflexive, sinceaa,and

transitive,sinceabandbcimpliesac,

itis notsymmetric,sinceabdoes notimplyba.

2. Electrical connectivity, where all connections are by metal wires, is an equivalencerelation.

Therelationisclearlyreflexive,asanycomponentisconnectedtoitself.Ifais electrically connectedtob,thenbmustbeelectricallyconnectedtoa,sothe relationissymmetric.Finally,ifaisconnectedtobandbisconnectedtoc,then aisconnectedtoc.Thuselectricalconnectivityis anequivalencerelation.

3.Twocitiesarerelatediftheyareinthesamecountry.Itiseasilyverifiedthat thisisanequivalencerelation.Supposetownaisrelatedtobifitispossibleto travelfromatobbytakingroads.Thisrelationisanequivalencerelationifall theroadsaretwo-way.

2.The DynamicEquivalenceProblem

Theinputisinitiallyacollectionofnsets,eachwithoneelement.Thisinitial representationisthatallrelations(exceptreflexiverelations)arefalse.Eachset hasadifferentelement,sothatSi Sj = ;thismakesthesetsdisjoint. Therearetwopermissibleoperations.

Thefirstisfind,whichreturnsthenameoftheset(thatis,theequivalenceclass)containingagivenelement.

Thesecondoperationaddsrelations.Ifwewanttoaddtherelationa~b,then wefirstseeifaandbarealreadyrelated.Thisisdonebyperformingfindson bothaandbandcheckingwhethertheyareinthesameequivalenceclass.If theyarenot,thenweapplyunion.Thisoperationmergesthetwoequivalence classescontainingaandbintoanewequivalence class.Fromasetpointof view,theresultof istocreateanewsetSk =SiSj,destroyingtheoriginals and preserving the disjointness of all the sets. The algorithm to do this is frequentlyknownasthedisjointsetunion/findalgorithmfor thisreason.

Thisalgorithmisdynamicbecause,duringthecourseofthealgorithm,thesets canchange via theunionoperation.Thealgorithmmustalsooperateon-line: Whenafindisperformed,it mustgiveananswerbeforecontinuing.Another possibilitywouldbeanoff-linealgorithm.Suchanalgorithmwouldbeallowedto seetheentiresequenceofunionsandfinds.Theansweritprovides foreachfind muststillbeconsistentwithalltheunionsthatwereperformedupuntilthefind, butthealgorithmcangiveallitsanswersafterithasseenallthequestions.The differenceissimilartotakingawrittenexam(whichisgenerallyoff-line--youonly havetogivetheanswersbeforetimeexpires),andanoralexam(whichison- line,becauseyoumust answerthecurrentquestionbeforeproceedingtothe nextquestion).