Hash (Symbol) Table
Symbol (Hash) Table
A symbol table is an abstract data type that supports insert, search, delete, contains and size methods. The important thing to note here is that unlike a dictionary which supports successor or predecessor queries, a symbol table is unordered. That is, there is no ordering of keys in the data structure.
Java hashCode()
hashCode()
Every object supports the method: int hashCode()
- it returns the memory location of the object. Every object hashes to a different location.
hashcode is always a 32-bit integer. Therefore, every 32-bit integer can get a different hashcode (without any collisions in this step!)
Rules regarding hashCode()
hashCode()
Always returns the same value, if the object hasn’t changed
If two objects are equal, then they return the same hashCode (but the converse is not necessarily true)
You must redefine the
equals(Object obj)
method to be consistent withhashCode()
Rules regarding equals(Object o)
equals(Object o)
Reflexive:
x.equals(x)
is trueConsistent: always returns the same answer
Null is null.
x.equals(null)
is always false
Java implementation of V get(Object key)
(Uses Chaining)
V get(Object key)
(Uses Chaining)Java int hash(int h)
int hash(int h)
Before hash()
is applied, the object is converted to an integer representation - that is called a pre-hash.
java.util.HashMap
java.util.HashMap
java.util.map
Interface
java.util.map
InterfaceMap is a parameterized interface: parameterized by key and value. The key and value need not be comparable.
No duplicate keys are allowed
No mutable keys are allowed - if you use an object as a key then you can’t modify that object later
Although java.util.Map also supports the following operations, it is not necessarily efficient to work with them since it is not sorted:
In java, TreeMap (dictionary) supports far more operations than HashMap (symbol table) but HashMap provides efficient implementations for the operations that it does support
Idea 1: Implement using AVL tree
But what if we don’t want/need any ordering or successor/predecessor queries? Is it possible to sacrifice the ordering to gain some extra speed in insertion and searching? YES!
Attempt 1: Use a direct access table - indexed by keys
For example, if you wanted to store (4, Devansh) in the table, it would be stored at index 4 of the array/table.
Problems:
Too much space (in particular, if keys are integers, then the table size > 4 billion)
How would you store non integral values?
How do you handle duplicates?
Hash Functions
Collisions
So, how do we deal with collisions? Chaining and Open Addressing, of course!
Chaining
Each bucket contains a linked list of items. So, all the keys that collide at a particular value are all inserted into the same bucket.
Question
Imagine that we implement a symbol table using a single-ended Queue as the data structure. What would be the worst-case time complexities of insert, delete and search operations? (Assume we do not try to insert duplicate keys. You can only access the queue using standard queue operations, e.g., enqueue and dequeue.)
Question
Assume that we're using a hash function to map keys to their respective positions in the symbol table. The time complexity for computing the hash is O(b)
for an input key of size b
while array access and comparing keys is constant time complexity. Assuming collisions are resolved by chaining linked lists and m
elements are present in the table, and assuming there are never duplicate values inserted, what is the worst-case time complexity of insert and search operations on keys containing n bits?
Simple Uniform Hashing Assumption (SUHA)
Every key is equally likely to map to every bucket.
Keys are mapped independently of previously mapped keys.
Question
Why don’t we just insert each key into a random bucket, i.e., why do we need a specific hash function at all?
Answer: Because then searching would be very very slow. How would you find the item when you need it?
Load
$load(hash\ table) = \dfrac{n}{m} = average \ # items/bucket$$
Let us calculate the expected search time in such a case.
Question
Open Addressing
Advantages
No linked lists!
All data is stored directly in the table
One item per slot
Logic: On collision, probe a sequence (deterministic) of buckets until you find an empty one
So now for this we need to have a new hash function,
Probing Code (Need not be Linear Probing)
Deleting a Key
If you just remove the key from the table and set it to null
, during search, the hash-search function may return key-not-found
even if the key exists (because of the “gap” in the probe sequence).
Simple solution: Don’t set the value to null
on deletion, set it to DELETED
to indicate that the slot is empty now but there was an element here before.
If you encounter a DELETED
value while probing for a bucket to insert, you can feel free to insert the new item there (because DELETED
is just a placeholder that can be safely overwritten)
Hash Functions
2 Properties of a good hash function:
Problem with Linear Probing: Clusters
But, in practice linear probing is very fast! Why?
Because of caching! It is cheap to access nearby array cells (since the entire block containing a lot of contiguous array cells is loaded into the main memory). For example, if you want to access A[17]
, the cache loads A[10...50]
. Then, it takes almost 0 cost to access any cells in that range. Recall that block transfer time is far far more than memory access time.
Good hashing functions
We saw that linear probing does not meet the second criteria of a good hash function, i.e., it does not satisfy UHA. So, the question is: How do we get a good hash function that satisfies UHA? Double hashing of course!
Performance of Open Addressing
Proof of Claim:
Summary of Open Addressing
Advantages:
Saves space - Empty slots vs linked lists (in chaining)
Rarely allocate memory (only during table resizing) - no new list-node allocations
Better cache performance - table all in one place in memory. Fewer accesses to bring the table into cache. Linked lists can wander all over memory.
Disadvantages:
More sensitive to choice of hash function: clustering is a common problem in case of linear probing
Table Resizing
Throughout this discussion of table resizing, assume that we are using hashing with chaining and our hash function obeys simple uniform hashing (although this is a false assumption - it can never be true in practice)
Idea: Start with a small constant table size. Grow and shrink the table as necessary.
For each item in the old hash table, compute the new hash function and copy item to the new bucket.
But... How much should we grow by? When should we grow?
Idea 1: Increment table size by 1
Idea 2: Double table size
For example, let your initial table size be 8 (it’s always good practice to keep your table size a power of 2). Then, the cost of resizing is:
Idea 3: Squaring table size
Shrinking Table
When the table becomes too big, we need to shrink the table. Again, the question is: When do we shrink?
Attempt 1
This is actually a really bad idea! Consider the following problem:
You might have to resize at every insertion and deletion
Attempt 2
Claim (very important invariants that you should be able to observe for any kind of growth/shrink rules given):
These two claims can be easily proven by observation.
Insertions and deletions are amortized because of table resizing (i.e., an insertion/deletion algorithm can trigger a table resize once in a while - which is expensive)
Inserts are not randomized (because we are not searching for duplicates)
Searches are expected (because of randomization of length of chains in each bucket) and not amortized (since no table resizing on a search)
Suppose we follow these rules for an implementation of an open-addressing hash table, where n is the number of items in the hash table and m is the size of the hash table. (a) If n = m, then the table is quadrupled (resize m to 4m) (b) If n < m/4, then the table is shrunk (resize m to m/2) What is the minimum number of insertions between 2 resize events? What about deletions?
Solution: Insertions: m/2 + 1; Deletions: 1 Suppose that the new inserted entry will cause n = m. Then the table will be resized into m′ = 4m. When we delete an entry, the number of entry will be n−1 = m−1 = m′/4−1 < m′/4. The table will resize again. Hence, the minimum number of deletion before resizing is 1.
At this time, the new table size is m′′ = m′/2 = 2m. The number of insertions before it expands is m′′ − (n − 1) = m′′ − (m′′/2 − 1) = m′′/2 + 1 elements. Note that the answer is calculated relative to the current table size.
Designing good hash functions
Simple Uniform Hashing does not exist! It is a false assumption made merely to simplify the analysis and understanding. Even UHA is a false assumption.
This is because keys are not random. There’s lots of regularity between keys and often there are mysterious patterns too! Patterns in keys can induce patterns in hash functions unless you’re very careful.
Consider the following example:
Umm, this is also a bad hash function. Lots of words collide and you won’t get a uniform distribution because most words are short and so the sum will likely be low)
But pretty good hash function exists (eg. SHA256 is a cryptographically secure hash function)
The moral of the story is simple: don’t ever design your own hash function unless you absolutely need to.
Designing Hash Functions
This is similar to pseudorandom generators - there’s no real randomness. Instead, the sequence of numbers generated just looks random.
For every hash function, there is some set of keys which leads to bad outputs!
But, if you know the keys in advance, you can choose a hash function that is always good. But if you chaneg the keys, then it might be bad again.
There are 2 common techniques used to design hash functions:
Division Method
Multiplication Method
Division Method
Collision is unlikely if keys are random.
Overall, the division method is popular (and easy) but not always the most effective. This is also because division is often slow.
Questions on Hashing
Implementing sets using Hash Tables
Consider the following implementations of sets. How would intersect and union be implemented for each of them? (a) Hash table with open addressing (b) hash table with chaining
Table Resizing question
Last updated