Notes on tree data structures
A plain binary tree can be implemented in a few dozen lines of code but decays into a linked list if nodes are inserted in sorted order; thus the worst-case time complexity for insertion and retrieval is linear instead of logarithmic.
Self-balancing binary trees are a class of data structures that offer logarithmic insertion and retrieval by proactively keeping the tree balanced upon insertion and deletion.
Why use binary trees? To implement key–value maps. Compared to hash tables, they do not require the choice of a hash function, tuning of parameters such as table size, or collision-handling strategies, and they can be traversed in sorted order. They are popular in functional languages (e.g., in OCaml's Base library) as they can be updated non-destructively. The downside is that insertion and retrieval take logarithmic instead of constant time.
- AVL trees use a set of four "rotation" operations (left, right, left–right, and right–left) to ensure that no node's left and right children differ in height by more than 1. Each node uses a couple of extra bits to store its "balance factor".
- The rotations are hard to wrap your head around; the diagrams in Levitin help. They are not that hard to implement; what's a lot trickier is updating the balance factor at each node after an insertion and possible rotation.
- My implementation in Go
- A more robust implementation in C
- Stores a parent pointer on each node to make the rebalancing easier; also packs the balance factor into the pointer.
- Red–black trees color each node of the tree as either red or black, and maintain invariants such that no node's left and right children differ by a factor of more than 2. Like AVL trees, red–black trees use rotations to ensure the heights remain balanced. Because the red–black invariants are looser, any AVL tree could be colored red–black, but not all red–black trees satisfy the AVL invariants. Concretely, red–black trees have faster insertions but slower retrievals.
- 2-3 trees allow nodes to have either 2 or 3 children. In the former case they are regular binary trees; in the latter case, the node has 2 keys, and its descendants are partitioned between the left subtree (
_ < key1
), the middle subtree (key1 < _ < key2
), and the right subtree (key2 < _
). 2-3 trees are perfectly balanced – all leaf nodes are at the same level. To remain balanced, insertions can cause splits that percolate up the tree. - B-trees have nodes with m keys, where m is some fixed constant. Like 2-3 trees, they are perfectly balanced. All nodes except the root must be at least half full. Insertions are handled similarly to 2-3 trees. B-trees are commonly used in database systems (see, e.g., "Database File Format" for SQLite): if each node occupies a disk page, then retrievals from disk can be done efficiently.
Another tree data structure (not a self-balancing binary tree) is the trie, which represents a set of strings and can efficiently find all strings in the set matching a prefix (which hash sets cannot do). In a traditional trie, each node has an array of children whose length is equal to the number of characters in the 'alphabet'. Even for just the ASCII character set, this is wildly inefficient, and for Unicode it is totally intractable. A space-optimized version is the radix tree which stores whole prefixes instead of individual characters at each node.
Bibliography
- Chapter 6 of Algorithms by Anany Levitin covers AVL trees and 2-3 trees.
- "Red-black trees (rbtree) in Linux" (2007)
- "The Ubiquitous B-Tree" (1979)