(a, b)-Trees
Aim: Allow efficient insertion, deletion, resizing of data in extremely large databases
Invariants
In an tree, and are the parameters where .
Respectively, and refer to the minimum and maximum number of children (NOT keys) an internal (i.e., non-root, non-leaf) node can have.
Each node has at most 2 children
Each node can have more than 2 children
Each node stores exactly 1 key
Each node can store multiple keys
Every -tree must satisfy the following 3 rules:
1. - child Policy
Root
1
2
Internal
Leaf
0
0
With the exception of leaves, note that the number of children is always one more than the number of keys.
2. Key Ranges
A non-leaf node (i.e, root or internal) must have one more child than its number of keys. This is to ensure that all value ranges due to its keys are covered in its subtrees. The permitted range of keys within a subtree is called its key range.
In particular, for a non-leaf node with keys and children,
Its keys in sorted order are
The subtrees due to its keys are
Then,
First child has key range
Final child has key range
All other children where has key range
3. Leaf Depth
All leaf nodes must be at the same depth from the root.
You should realize that a regular BST is just an tree with .
B-Trees
Firstly, it is important to note that the B in B-Tree does not refer to Binary. Actually, no one really knows what the B stands for - it could be Balanced?
B-trees are simply trees. That is, they are a subcategory of trees such that . For instance, when , we have a -tree. (This is sometimes referred to as a tree. An example of a -tree is given below:
Are B-trees balanced?
Yes! This is because of Rule 3: All leaves must be at the same depth from the root. This ensures height-balance, which implies balance. In fact, any two siblings (in terms of depth too!) are at exactly the same height. This is due to the fact that a B-tree grows upward from the leaves.
What is the minimum and maximum height of an tree with n keys?
The minimum height would be when each leaf has maximum possible number of children (i.e, the tree is dense) and so, minimum height is and maximum height would be .
Now that we have established that an tree is balanced and height is , we see how it helps us for searching, insertion and deletion.
Searching
What data structure should we use for storing the keys and children in a node?
One possible way is to use an array of (key, subtree) pairs as follows:
Total search cost?
Height of the tree is .
Binary search (woah binary search to the rescue again!) for the key at every node takes time. (Note that is a constant, and so we consider this as constant time)
Hence, total cost: .
Insertion
The key idea is that just like in BSTs, we only insert at leaves here. So, the idea is to navigate to a suitable leaf and add the new key to its key list.
For example, consider the insertion of the key 71 to the -tree below:
But now what happens when you insert 72? You cannot have 4 keys at a single node! Something needs to be done to solve this problem!
Insertion may violate Rule 1: Recall that rule 1 enforces that nodes can have at most keys. Insertion may cause the leaf nodes to grow too large. We need to have some operation to handle such cases.
Idea: Redistribute out the keys!
How can we redistribute the keys?
Over sibling nodes?
No! Cannot guarantee that the rule is still not violated (i.e., now the sibling may be too large)!
Over a new node?
Yes! We can always guarantee that the postcondition adheres rule 1.
Split Operation (Key-redistribution)
Choose the median key in the overpopulated node.
Use that to split the keylist into 2 halves
Left half goes into a new node
Move median key to the parent node (what do we do when the root gets overpopulated? it has no parents! don’t worry, we will deal with this too)
Link the two nodes to the parent.
If we perform the split operation on the example tree above, the following steps are executed:
Why must the split
operation “offer” one key to the parent?
After splitting, the parent will have one more child than before, therefore it must also have one more key. Taking the (median) key from the node to be split is convenient.
Will the split procedure ever end up with nodes too small (i.e, too few keys) such that they violate rule 1?
No. this is because:
Keylist size of LHS is
Keylist size of RHS is
In both cases, the size of the keylist is at least .
But there’s another problem.. What happens if the parent node becomes over-filled after one of its children is split? Umm.. no big deal. Just perform split on the parent node then. So, during insertion, in the worst case, you might have to perform a split at every node on the root-to-leaf path on the way back from the recursive call. But all split operations take time (essentially its something like a rotation - just moving some pointers here and there and so the total time of insertion remains ).
What about when the root is overpopulated? How do you split the root?
Follow the same procedure as before
Instead of elevating split key to the parent (there’s no parent for root), make it the new root! (This is exactly why we allow the root to have 1 key and there’s no minimum condition on the number of keys in the root)
In other words, if was the old root that become overpopulated, then the following steps explain how to split a root node:
Pick the median key as the split key
Split into LHS and RHS using
Create a new node
Move LHS split from to
Create a new empty node
Insert to
Promote to be the new root node
Assign and to be the left and right child of
Assign previous subtree associated with to be the final child of
Will the split operation ever violate rule 3?
No! A split on node will create a new node so:
If is an internal node, will be at the same level and nothing will be pushed down since was split from
If is a leaf node, will also be a leaf node at the same level
If is the root node, everything will be pushed down 1 level since a new root is created above (maintaing the invariant that all leaf nodes are at the same depth).
Unlike BSTs which grows the leaves downwards, -trees grows the root upwards! Therefore, leaves are always guaranteed to be on the same level as one another.
It should be pretty clear that the cost of insertion into an -tree with nodes is and the number of split operations that need to be performed during insertion is also .
Deletion
We only explain deleting keys from leaves here. To delete a key from an internal node, swap with its successor, and then delete it off.
We first find the key. Then delete it from the keylist. But wait... what if the number of keys in the node falls below ?
Deletion may cause internal nodes to shrink too small (and hence, violate rule 1).
Idea: Join with an adjacent sibling!
Merge Operation
Let us explain this with an example. Consider a -tree in which you wanted to delete the element with key = 30. That would go as follows:
Merge
is intuitively the reverse operation of split
.
In split
, we promote the median key to form LHS and RHS nodes. In merge
, we demote the key in parent separating LHS and RHS to join them together.
Why can’t we just join the 2 children directly? Why do we need to bring down 1 key from the parent?
For the same reason as split
. After merging, the parent will have 1 less child than before, therefore, it must have one less key than before. Moving that key into the newly merged node is convenient
Can a merged node be too large such that it violates rule 1?
Yes, but that is not a problem because then we can perform the split
operation on that node to make it normal. FYI, when merge
is followed by split
, we call that a share
operation. That is, share
= merge
+ split
. In summary,
What about deleting a key from an internal node or root?
Think about AVL trees. Swap the key to be deleted off with its predecessor/successor which will be in a leaf node (this is critical since we are explaining only deleting keys from leaf nodes, it would be a huge problem if the successor/predecessor of a key in an internal node was not in a leaf node since our algorithm wouldn’t be able to delete keys from internal nodes then). The predecessor would be the right-most key in the left subtree and the successor would be the left-most key in the right subtree.
Handling Duplicates
Instead of just storing keys, store a pair of (key, insertion order) where insertion order refers to the order when said key is inserted (e.g. timestamp)
Why B-Trees over BST?
Both B-trees and BST have time for search, insert and delete operations. Then why do we need to learn about B-trees when we can use BSTs anyway?
In general, data is stored on disk in blocks, e.g. . Each block stores a chunk of memory of size . (Think of each block as an array of size ). When accessing a memory location in some block , the entire block is read into memory. You can assume that your memory can hold some blocks at a time.
If you had an array of length how many blocks would you need to transfer to the memory to do a linear search for an element?
Ans: (since is the amount of data in any 1 block - let’s say that each element takes up 1 unit of data)
What about doing a binary search on an array of length ?
Ans: block transfers. Think of this as doing a binary search on blocks. Once your search finds the right block, it is loaded into memory and the rest of the search is free.
Now imagine you are storing your data in a -tree (Notice that you might choose depending on how you want to optimize). Notice that each node in your B-tree can be stored in blocks. For example, one block stores the key list, one block stores the subtree links and one block stores the other information (e.g. parent pointer, auxiliary information, etc.) Now what is the cost of searching a keylist in a B-tree node? What is the cost of splitting a B-tree node? What is the cost or merging or sharing a B-tree node?
Ans: !!! Since after you load the block in memory, everything happens soo fast that the time taken for that is negligible compared to accessing data stored in disk.
So, what is the overall cost of searching a -tree? Wht is the cost of inserting or deleting in a -tree? (here refers to the block size)
Ans: , i.e., the cost only depends on the height of the tree, since each of the operations to access a single node is only cost . The important thing here is in the value of . Searching in a BST is while searching in a -tree is . In practice, is a pretty large number. For instance, if your disk has 16KB blocks (which is reasonably normal) and you set , then given a 10TB database, your -tree just requires 3 levels. The root of your B-tree will always stay in memory. For typical memory sizes (e.g. 256MB disk cache), the first level of your B-tree will also always be in memory. Thus, the cost of searching your 10TB database is typically one block transfer. For a 1000TB disk, you have 4 levels and 2 block transfers. That’s why its really hard to beat a well-implemented B-tree.
Last updated