Used Where We Have More Data Than Can We Can Fit in Main Memory

B-Trees

A B-tree is a balanced search tree designed to work well on magnetic disks or other direct-access secondary storage devices.

Used where we have more data than can we can fit in main memory

The objective of using a B-tree is to minimize the number of disk accesses.

Example

/ Issues:
Revolutions per second
Time it takes to move the read/write head.
Placement of data on the disk.

Machine executes 25,000,000 instructions per second

On a disk that spins at 7200 RPM, we can do 240 disk accesses per second.

The number of disk accesses is determined by the size of the page that needs to be read or written to disk.

If a page contains 4096 bytes, then we can read 240/4096=983040 bytes per second.

Assume that each page represents the income tax return for one person for one year. Also assume that 20,000,000 people submit income tax returns in a year. It will take approximately 83333 seconds to read all the data (i.e. 23 hours) to find the last page.

If the data in the file was stored as an AVL tree, it would still require 24.3 disk accesses on average (log2 20,000,000).

It would be nice if we could utilize more instructions and fewer disk accesses (1 disk access = 100,000 instructions).

Since the typical AVL tree is close to optimal height, we can’t go below approximately disk accesses using a binary search tree.

Our only alternative is to increase the number of branches at each node b. No longer a binary search tree but an K-ary search tree, where K is the number of branches at each node.

Whereas the height of an AVL tree is approximately , the height of a balanced K-ary search tree is approximately .

Example

A B-tree of order is a K-ary tree having the following properties (these properties describe one variant of a B-tree).

Every node contains the following information:

The number of keys currently stored in the node.

The keys stored such that key1 < key2 < ... < keym-1

A boolean value that is TRUE if the node is the root, and FALSE otherwise.

A boolean value that is TRUE if the node is a leaf, and FALSE otherwise.

If the node is not a leaf, it contains up to m pointers to its children. Otherwise the pointers are undefined (i.e. leaves have no children, of course).

The keys separate the ranges of keys stored in each sub-tree. If Ki is any key stored in a sub-tree, then

Key1≤ K1 < Key2 ≤ K2 < ... < Keym ≤ Km

Every leaf has the same depth, which is the tree’s height h.

All non-leaf nodes (except the root) must have between and m children.

All leaf nodes have between and L keys.

Note: For simplicity, we can assume here that m=L for the remainder of our discussion.

Attempting to insert a new value into a page may cause the page to be split.

Example 1 (splitting the root)

Example 2 (splitting a leaf)

Example 3 (splitting a non-leaf/non-root)

B-tree node structure

const int PAGE_SIZE = 5;
struct BTreeNode;
struct Keys
{
Object element;
BTreeNode *next;
}
struct BTreeNode
{
bool root;
bool leaf;
int n;
keys page[PAGE_SIZE];
};

Operations on B-trees

Create a new B-tree

Input:t, a pointer to a BTreeNode type.

Output:t, a pointer to the root node.

Create(t)

t = new BTreeNode
t->root = TRUE
t->leaf = TRUE
t->n = 0
for i = 0 to PAGE_SIZE-1
t->page[i].next = NULL

Simply initializes a root node for the B-tree.

Search for a value in the tree

Input:x, the value to find, and t, a pointer to the root node.

Output:TRUE, if x found, FALSE otherwise.

Find(x, t)

if t->leaf == TRUE
for i = 0 to t->n - 1
if t->page[i].element == x
return TRUE
return FALSE
else
i = t->n - 1
while i > 0 and x < t->page[i].element
i--;
return Find(x, t->page[i].next)