Indexing and Hashing

Indexing

1.Introduction

Most queries require only a small fraction of the rows of a table(s). It will be very inefficient if we need to examine every row of the table(s) to determine the desired data. To facilitate the retrieval of rows, commonly we design additional access structures called indices.

There are two types of indices:

(a)ordered indices: based on a sorted ordering of the values

(b)hash indices: based on values distributed across a number of buckets. The bucket to which a value is assigned is determined by a function called a hashfunction.

An index is constructed using the values in one or more columns (or attributes) of a table. This attribute or set of attributes is called a search key. Note that unlike the primary key or candidate key, values of a search key may not be unique. An index allows us to identify those rows in a table with a given search-key value(s) quickly. The columns (or attributes) of a search key are also called indexing fields.

2.Ordered Indices

Two kinds of ordered indices:

(i)primary index : rows in file are sequentially ordered according to the search key values; also known as clustering index

(ii)secondary index: rows not ordered in search key order; also known as nonclustering index.

An index record in an index file has two fields: asearch-key field and a pointer field

An index is said to be dense if it has an index record for every search-key value in the table.

An index is said to be sparse if it has index records for only some of the search-key values appearing in the table.

2.1 Primary Index

The following figure shows a dense index for table account(branchName, accountNo, balance). The search key is branchName.

The following figure shows a sparse index for the same account table with the same search key branchName.

For a dense index, use binary search to locate the index record for a given search-key value. Then, follow the pointer in the index record to find the row(s) having the search-key value.

For a sparse index, use binary search to locate the index record with largest search-key value less than or equal to the given search-key value. Then, follow the pointer to the row in table and starting from that row, read the rows sequentially until the row with the given search-key value is found or a row with a higher search-key value is reached in which case no row in the table has the given search-key value.

Note that data is moved from disk to main memory a block (or page) at a time. For a sparse index, typically we have an index record for each block (or page); and the search-key value of an index record is the search-key value of the first record in the associated block. This first record is called the anchor record of the block.

A dense index has an index record for each search-key value. Thus, for a dense index, we can determine whether a given search-key value is in the table or not by simply checking the index alone. This cannot be done with a sparse index. As in general, an index record is much smaller than a row in a table, it is faster using a dense index to check whether a given search-key value is in the table.

2.2 Secondary Indices

A secondary index must be dense. This is because the rows are not ordered on the search-key of the secondary index.

If the search-key is a candidate key, each search key value is associated with only one row in the table. The pointer of each index record points directly to the row with the search-key value associated with the index record. Figure I (p.3) shows a secondary index on the table account(branchName, accountNo, balance). The table is ordered on branchName. The accountNo is a candidate key.

If the search-key is not a candidate key, multiple rows may have the same search-key value. Unlike the primary-index, we cannot just keep a pointer to the first row with the associated search-key value. This is because rows with identical search-key values may not be stored consecutively. In this case, we need an extra level of indirection. The following figure shows the structure of such a secondary index. The table account is now ordered on the accountNo. The search-key is branchName, which is not a candidate key. The pointer in each index record points to a bucket of pointers, each of which points to a row with the search-key value associated with the index record. For example, the first index record points to a bucket with three pointers, each pointing to a row with search-key value equal to 'Everygreen'.

3. Multilevel Indices

If the number of index records is very big, the cost of using the index can be large.

Note that if an index occupies b blocks of disk space, a binary search for a given search-key value requires log2b disk block reads.

Suppose that an index has 10,000 index records and 100 index records fit on a block. Then, the index needs 100 blocks of disk space. To do a binary search for a search-key value requires log2 100 or 7 block reads. Assuming a block read takes 30 milliseconds, the search will take 210 milliseconds.

To speed up the searching, a common solution is to construct an index on the index file if the index file gets too large. Typically, the index on the index is sparse. It contains an index record for each block of index records of the lower level index. If even this index becomes too big, we can construct another level of index. An index with two or more levels is called amultilevel index.

For the above-mentioned example, if we construct an index on the index (with 100 blocks of index records), it will have 100 index records, which fit on one block. Thus, to locate a row with a given search-key value, we

read in the block for the outer index
do a binary search to find the block containing the desired index record
read in the block found in the previous step
do a binary search to locate the index record for the row(s) with the given search-key value

This takes two disk block reads (versus 7 disk block reads if single level index is used). The following figure shows an example of a two-level sparse index.

4. B+ Tree Index

A B+ tree index is a balanced tree. Every path from the root to a leaf of the tree is of the same length. B+tree index structure is very commonly used, despite additional space requirement and performance overhead on insertion and deletion.

Each non-leaf node, other than the root, has between n/2 and n children, n being fixed for a particular B+ tree. If there are S search-key values, the path from root to a leaf is no longer than logn/2 S . The following figure shows a node of a B+tree.

In the node shown above, K1, K2,, …, Km-1 are (m - 1) search-key values, and P1, P2,, …, Pm are m pointers. The search-key values in a node are sorted, i.e., K1 < K2, < … < Km-1.

If the node is a leaf node, for j = 1, 2, …, m- 1, the pointer Pj points to either a row with search-key value Kj or to a bucket of pointers, each of which points to a row with search-key value Kj ; and Pm points to its sibling leaf node with larger search-key values. The following shows two leaf nodes of a B+tree.

Each leaf node can hold up to n-1 search-key values and contain as few as (n-1)/2 values, i.e., (n-1)/2  m-1  n-1 in the first figure. The ranges of values in the leaf nodes do not overlap.

Except for the root node, a non-leaf node of a B+ tree may hold up to n pointers and must hold at least n/2 pointers. (This value of n may or may not be the same value of n for a leaf node.) The root node may hold fewer than n/2 pointers.

For j = 2, 3, …, m-1, pointer Pj points to the subtree that contains search-key values less than Kj and bigger than or equal to Kj-1

Pointer Pm points to the part of the subtree that contains those search-key values bigger than or equal to Km-1(and less than the search-key values of any right sibling node)

Pointer P1 points to the part of the subtree that contains those search-key values less than K1 (and bigger than the search-key values of any left sibling node).

Note that Kj-1 is the minimum of the search-key values contained in the subtree pointed to by Pj (i.e., the search-key value of the left-most index record of the left-most leaf node of this subtree.)

The following is an example of a B+tree. The value of n for this B+tree is 3. For simplicity, the pointers to the rows of the account table have been omitted.

Notice that for example, Fordham in the root is the minimum of all search-key values in the subtree pointed to by the left pointer.

Finding records with a search-key value s

(i) Examine the root node for the smallest search-key value bigger than s.

(ii)If such a search-key is found in step (i), let Kj be this search-key. Then, follow the pointer Pj to the child node.

(iii)If no such search key is found in step (i) (i.e., s is bigger than or equal to all search-key values in the root node), then follow the last pointer Pm to the child node.

(iv)Repeat the above steps for the child node recursively until the leaf node containing the index record for the search-key value s is found.

(v) Search the leaf node found in step (iv) for the index record. If found, follow the pointer to get the desired records (or rows).

Note: Suppose a node of a B+tree has n nodes. If there are S search-key values, the path from the root node to a leaf node is no longer than logn/2 S .

Commonly, a node has the same size as a disk block, which is typically of size 4 K. Assume that a search-key has size 12 bytes and a disk-pointer has size 8 bytes. Each node can store around 200 index records (i.e., n = 200). Suppose we have 1,000,000 search-key values. With n = 200, the depth of the B+tree is log 200 1000000 = 3. That is, to lookup an index record with the desired search-key value needs 3 disk-block reads.

Inserting an index record into a B+tree

(i) Find the leaf node for storing the index record to be inserted.

(ii)If the leaf node is not full, insert the index record into the node in the proper position
(the index records need to be ordered on the search-key)

(iii)If the leaf node is full, then

(a) create a new node;

(b) put the first n/2 index records in the existing node and the remaining index records in the new node (usually the right sibling of the existing node). Thus, the new leaf node will have n n/2 = (n-1)/2 index records.

(c)insert a new index entry in the parent node and update its entries to reflect the changes in its child nodes. If the parent node is already full, then
--create a new node,
-- divide the pointers between the parent node and the new node in a fashion
similar to step (b),and
-- update the search-key values in the index entries.

This process may be propagated to the root node, in which case the root node may be split into 2 nodes and a new root node be created.

(iv)(Aside: if the parent node is full, then it must have (n-1) search-key values and n pointers. With the newly added index entry, before splitting, the parent node must have n search-keys and (n+1) pointers. Thus, after splitting the pointers, the parent node will have (n+1)/2 pointers and the new node will have (n+1)  (n+1)/2 = n/2 pointers.)

The following shows how the B+tree given on p.6 is constructed starting with a full node with two search-key values: Fordham and Mayflower. Note that n = 3 .

After insertion of the Dalton record and creation of a root node:

Just after insertion of the Binghamton record,

After splitting the records and update the root node:

Just after insertion of the Evergreen record,

After splitting the records and updating the root node,

After splitting the root and creating a new root,

Deleting an index record from a B+tree

(i)Find the index record to be deleted.

(ii)If the leaf node becomes empty after step (i), remove the leaf node and adjust the chain pointer of its left sibling. Then, check the siblings of the parent node of the deleted leaf node. Search-key entries and pointers in these nodes may be redistributed or the parent node may be combined with one of its sibling.

Consider the following B+tree.

After the deletion of the index record for Branch Fordham, the leaf node becomes empty.

Redistribute the search-key entries and the pointers in the parent node of the deleted leaf node and its sibling node.

Finally, updating the search-key entry in the root node:

Immediately after the deletion of index record for search-key value Evergreen, the associated leaf node becomes empty and hence, it is also deleted.

However, the parent node of the deleted leaf node and its sibling do not have enough pointers for redistribution. Hence, they are merged into a single node and their parent node (the root node) is deleted.

The first search-key value of the new node is Evergreen. In the subtree pointed to by the pointer to the right of this value Evergreen, the minimum search-key value is Fairfax. Thus, the Evergreen search-key value is changed to Fairfax.

B-tree

A B-tree index is similar to a B+tree index. Note that in a B+tree, some search key valuesappear twice. In the above B+tree, for example, the search key Mayflower occurs once in the root node and once in the rightmost leaf node. In the B-tree, no search key value is stored more than once. It achieves this by including an extra pointer field for each search key in a non-leaf node. Below is an example of a B-tree.

Evidently, B-tree requires less space than the B+tree does (by not storing any search key more than once). Also, in a look-up for asearch key value appearing in non-leaf nodes, the number of nodes needs to be accessed is smaller for a B-tree.

However, because more pointers are stored in a non-leaf node of a B-tree, the number of search key values in a non-leaf node is less than the number of search key values in a non-leaf node ofa B+tree. (Typically, a node is the size of a block.) Thus, a B-tree index may be deeper than a B+tree. That is, for a search key value in the leaf node of a B-tree, the lookup time may be higher than in the case of a B+tree. Furthermore, deletion in a B-tree is much more complicatedthan that in a B+tree. Thus, a B+tree is more commonly used.