2.Physical Database Design

Assume a relation R(A, B, C) is given; R is stored as an unordered file (assume heap blocks are 100% full) and contains 1000000 (1 million) tuples. Attributes A, B and C need 4 byte of storage each, and blocks have a size of 4096 Byte. Attribute A is a candidate key. Moreover, we assume that static hashing is used to implement index structures, and that index pointers require 4 byte of storage; furthermore, you can assume that pages of index blocks are 80% full and do not contain any overflow pages; each B-value occurs at an average 10000 times in the database and each C value occurs 10 times in the database. What index structures would you create to speed up the following 4 SQL statements (value is a placeholder for an integer constant)?

S1 / S2 / S3 / S4
Select A, C
from R
where B = value / Delete
from R
where C = value / Select sum (R.A)
from R / Insert
into R
values (3,4,5)
returns 10,000 answers / deletes 10 tuples / returns one answer / inserts 1 tuple

Describe which index structures you would create (justify your design!) trying to minimize the sum of the cost of executing the four sql-statements; indicate for each statement how you would implement it and give the cost for your chosen implementation plan (in number of block accesses including the formula you used to compute the cost).

Design:

Overview: Assuming these four statements occur with equal frequency, an index on A and an index on C should be created

Cost Evaluation:

Background calculations:

Bytes per tuple = (3 attributes / tuple) x (4 Bytes / attribute) = 12 Bytes / tuple

Tuples per block =  (4096 Bytes / block) x (1 tuple / 12 Bytes)  = 341 tuples / block

Total number of blocks = 1,000,000 tuples x (1 block / 341 tuples)  = 2930 blocks for R

Size of Single Attribute Index: 106x8/(0.8x4096)=2442 blocks

Size of Two Attribute Index: 106x12/(0.8x4096)=3633 blocks

Size of Three Attribute Index: 4883 blocks

Although this statement is an equality search on B, an index on B does not help. A hash index on B would return 10,000 block references, whereas there are only 2933 blocks in the entire heap! Furthermore, a hash index on B cannot be created without violating the assumption that the hash index does not create overflow pages (one 80% full index block can only hold 819 entries, but there are 10,000 entries for each B value).

Cost:

Reads: It would be best to scan the heap directly - 2933 blocks.

Writes: None (assuming result isn't written to disk).

Because of the relatively low number of tuples per C value, an index on C could be employed. Note that after the deletions, the A index must be updated as well.

Cost:

Reads: 1 (index on C block) + 10 (heap blocks) + 10 (index on A blocks) = 21 blocks

Writes: to enact deletes, all blocks that were read (index and heap) must also be modified and written back to disk (21 blocks)

Assuming the DBMS supports index-only evaluation of statements, this statement could be executed most efficiently using an index on A:

Cost:

Reads: 2442 blocks, as explained above.

Writes: None (assuming result isn't written to disk).

To insert a tuple onto the heap, any not-full block of the heap is accessed and written to (or a new block is added to the heap if all blocks are full). The two indexes must be updated to reflect insertion.

Cost:

Reads: 1 (not-full heap block) + 1 (index on A block) + 1 (index on B block) = 3 blocks

Writes: 1 (heap block) + 1 (index on A block) + 1 (index on B block) = 3 blocks

Cost Summary:

S1: 2930 reads

S2: 21 reads and 21 writes

S3: 2442 reads

S4: 3 reads and 3 writes

Total: 5396 reads and 24 writes= 5420 block accesses

3. B+ Trees

a) Assume an entry is deleted from a B+-tree of height 2 (that is, it has a root an intermediate layer and a leaf layer). Also assume with each node of the B+-tree corresponds to one block on the disk. How many different blocks have to be accessed (using read or write) in the worst case to perform this deletion? Give reasons for your answer!

Assumptions:

1. ) the data records themselves (not just pointers to them) are stored at the leaf nodes

2.) A non-root node only examines one of its siblings when it has less than the required number of entries

Answer:

The worst case to perform the deletion is the following scenario:

After deleting on entry in leaf node and intermediate level, underflow occurs at leaf level and intermediate level, merging 2 nodes at leaf level, merging 2 nodes of intermediate level and root node as new root. For example, deleting “24” in the following B+ tree:

Number of different blocks accesses calculation:

Number of different blocks access to finding the leaf node which has the entry need to be deleted and delete this entry: Height +1 = 2+1 = 3

After underflow occurs at leaf level, merge two leaf nodes in main memory and write back to the one of the previous data blocks.

Underflow occurs at intermediate level, merging 2 intermediate nodes (one of them was already visited before) and root nodes and writing back as new root node needs 2 different block accesses

Total number of different blocks access needed: 3+2 = 5

b) B+-trees are very popular for implementing index structures. Why is this the case? What properties of B+ trees make them a good choice for implementing index structures?

B+ trees are popular for implementing index structures because

1.)They allow entries to be found relatively quickly - the number of blocks that must be accessed to find an entry is equal to the height of the tree plus one, and the height of the tree is almost never more than six.

2.)They support ranged searches - hash indexes do not

3.)Insertions and deletions are inexpensive on average - most insertions and deletions do not change the structure of the tree and require only the number of block accesses required to navigate the tree plus one block write. Even in the worst case of tree structure change propagation all the way to the root, only a handful of extra blocks are accessed: (O(logF(N)=O(height-of-the-B+-tree).

4.) B+ trees are self reorganizing data structures that adapt their height and structure based on the number of entries stored.

c) Assume the tuples of a relation R(A,B,C) are stored in a B+-tree using attribute C as the search key with m=400 and p=300 of height 3 (one root, 2 more intermediate layer, one leaf layer) with each node of the tree corresponding to one block on the disk. Moreover, the following query Q1 is given that returns 2000 answers:

Select A

from R

where C < 5523 and C > 2345

How would you implement Q1 trying to take advantage of the B+-tree? Also compute the number of block accesses of the best implementation of the query as precisely as possible, and explain your computations!

Assumption: Leaf nodes are assume to be full containing m entries (if they would be only 2/3 full, 8 instead of 5 leaf nodes would have to be accessed)

Implementation: navigate the B+ tree to find the only leaf node that could possibly contain the search key C = 2346. Search this leaf node until a record with C 2346 is found. Starting with this record, progress through leaf nodes by following pointers to next sibling, recording A values of all records until a record with C 5523 is found[1].

Block accesses: 2000 answers requires that at least 2000 records x (1 leaf node/400 records) = 5 leaf nodes are accessed. However, it is very unlikely that the range of answers will start with the first record in the first leaf node, so it is assumed that 6 leaf nodes are accessed. Thus the cost are:

3 (navigation of tree to first leaf node (not including leaf node)) + 6 (leaf nodes) = 9 blocks

d) Assume that the following B+-tree with p=5 and m=3 is given. Furthermore, assume that the keys 1, 21, 24, 99, 34, 35 are deleted in the indicated order. Show how the tree looks like after each deletion. [3]

Deleting 1

Deleting 21

Deleting 24

Deleting 99

Deleting 34

Deleing 35

1.1.

[1] Be careful; entries inside a block are not necessarily sorted; however, if you see a key higher than 5523 you know that you can stop reading any additional leaf nodes.