🧑‍💻
CS2040S Data Structures and Algorithms
  • Algorithms
    • Overview
    • Time Complexity
    • Binary Search
  • Sorting Algorithms
  • Order Statistics
  • Interval Searching
  • Orthogonal Range Queries
  • Random Permutation Generation
  • Disjoint Set Union
  • Data Structures
    • Binary Search Tree
    • Trie
    • Hash (Symbol) Table
    • (a, b)-Trees
    • (k, d)-Trees
    • Heap
  • Graphs
    • Introduction
    • BFS and DFS
    • DAG and Topological Sort
    • SCC (Tarjan's and Kosaraju's)
    • SSSP (Bellman-Ford and Dijkstra)
    • MST (Prim's and Kruskal's)
  • Dynamic Programming
    • Concept
    • APSP Floyd-Warshall
    • Longest Increasing Subsequence
    • 0-1 Knapsack
    • Prize Collecting
    • Vertex Cover on a Tree
Powered by GitBook
On this page
  • Symbol (Hash) Table
  • Java hashCode()
  • java.util.HashMap
  • Idea 1: Implement using AVL tree
  • Attempt 1: Use a direct access table - indexed by keys
  • Hash Functions
  • Collisions
  • Chaining
  • Load
  • Open Addressing
  • Performance of Open Addressing
  • Summary of Open Addressing
  • Table Resizing
  • Shrinking Table
  • Designing Hash Functions
  • Questions on Hashing

Was this helpful?

  1. Data Structures

Hash (Symbol) Table

Symbol (Hash) Table

A symbol table is an abstract data type that supports insert, search, delete, contains and size methods. The important thing to note here is that unlike a dictionary which supports successor or predecessor queries, a symbol table is unordered. That is, there is no ordering of keys in the data structure.

Java hashCode()

Every object supports the method: int hashCode() - it returns the memory location of the object. Every object hashes to a different location.

hashcode is always a 32-bit integer. Therefore, every 32-bit integer can get a different hashcode (without any collisions in this step!)

// implementation of hashCode() for integers (32-bit)

public int hashCode() {
	return value;
}

// implementation of hashCode() for Long (64-bit)
// there will be collisions because the number of items is twice the possible number of hashcodes.
// In particular, for every hashcode, there will be 2 Longs that have that same hashCode.
// Note: x >>> n removes the last n (least significant) bits from x
// So, XOR is performed between the first 32 bits and the last 32 bits of the Long
public int hashCode() {
	return (int) (value ^ (value >>> 32));
}

// implementation of hashCode() for Strings is a little more complicated
public int hashCode() {
	int h = hash;
	if (h == 0 && count > 0) {
		int off = offset;
		char val[] = value;
		int len = count;
		for (int = 0; i < elen; i++) {
			h = 31*h + val[off++]; // we choose 31 because it is prime and also 2^5 - 1 (close to a power of 2)
		}
		hash = h;
	}
return h;
}

Rules regarding hashCode()

  • Always returns the same value, if the object hasn’t changed

  • If two objects are equal, then they return the same hashCode (but the converse is not necessarily true)

  • You must redefine the equals(Object obj) method to be consistent with hashCode()

Rules regarding equals(Object o)

  • Reflexive: x.equals(x) is true

  • Symmetric x.equals(y)   ⟺  \iff⟺y.equals(x)

  • Transitive x.equals(y) ∧\wedge∧ y.equals(z)   ⟹  \implies⟹x.equals(z)

  • Consistent: always returns the same answer

  • Null is null. x.equals(null) is always false

Java implementation of V get(Object key) (Uses Chaining)

public V get(Object key) {
   if (key == null) return getForNullKey();
   int hash = hash(key.hashCode());
   for (Entry<K,V> e = table[indexFor(hash,table.length)]; e != null; e = e.next) {
		Object k;
		if (e.hash==hash &&((k=e.key)==key)||key.equals(k))) // Java checks if the key is equal to the item in the hash table
		// before returning it
		      return e.value;
		}
   return null;
}

Java int hash(int h)

static int hash(int h) {
   h ^= (h >>> 20) ^ (h >>> 12);
   return h ^ (h >>> 7) ^ (h >>> 4);
}

Before hash() is applied, the object is converted to an integer representation - that is called a pre-hash.

java.util.HashMap

java.util.map Interface

Map is a parameterized interface: parameterized by key and value. The key and value need not be comparable.

No duplicate keys are allowed

No mutable keys are allowed - if you use an object as a key then you can’t modify that object later

Although java.util.Map also supports the following operations, it is not necessarily efficient to work with them since it is not sorted:

In java, TreeMap (dictionary) supports far more operations than HashMap (symbol table) but HashMap provides efficient implementations for the operations that it does support

Idea 1: Implement using AVL tree

If we implement a symbol table using an AVL tree, the cost of insertion is O(logn)O(logn)O(logn) and the cost of searching is also O(logn)O(logn)O(logn) (Moreover, this is because it stores the key in an ordered fashion and hence it also supports successor and predecessor queries).

But what if we don’t want/need any ordering or successor/predecessor queries? Is it possible to sacrifice the ordering to gain some extra speed in insertion and searching? YES!

Our aim is to implement a symbol table with cost of insertion and cost of searching O(1)O(1)O(1).

It is a well-known fact that any comparison based sorting algorithm requires at least Ω(nlogn)\Omega(nlogn)Ω(nlogn) comparisons. Furthermore, the fastest searching algorithms (eg. binary search) require at least Ω(logn)\Omega(logn)Ω(logn) comparisons. So, how is it possible to achieve O(1)O(1)O(1) for insertion and searching?? Simple - we don’t use comparison based operations (at least not directly - i.e., we don’t directly compare the elements, we compare based on their hashes 😅).

Attempt 1: Use a direct access table - indexed by keys

For example, if you wanted to store (4, Devansh) in the table, it would be stored at index 4 of the array/table.

Problems:

  1. Too much space (in particular, if keys are integers, then the table size > 4 billion)

  2. How would you store non integral values?

  3. How do you handle duplicates?

Hash Functions

We use hash functions to map a huge possible number of keys (out of which we have nnn actual keys) to mmm number of buckets.(You don’t know which keys are going to be inserted, and new keys can be inserted, old keys can be deleted, etc.)

So, we define a hash function h:U→{1,…,m}h : U \rightarrow \{1,\dots,m\}h:U→{1,…,m} where UUU is the universe of possible keys (the permissible keys). We store the key kkk in the bucket h(k)h(k)h(k).

Time complexity for insertion: Time to compute hhh + Time to access bucket. If we assume that a hash function has O(1)O(1)O(1) computational time, then we can achieve insertion in O(1).O(1).O(1).

Collisions

We say that 2 distinct keys k1,k2k_1,k_2k1​,k2​ collide if h(k1)=h(k2)h(k_1) = h(k_2)h(k1​)=h(k2​).

By pigeonhole principle, it is impossible to choose a hash function that guarantees 0 collisions (Since the size of the universe of keys is much larger than mmm)

So, how do we deal with collisions? Chaining and Open Addressing, of course!

Chaining

Each bucket contains a linked list of items. So, all the keys that collide at a particular value are all inserted into the same bucket.

How does this affect the running time of insert? No change! Still O(cost(h))O(cost(h))O(cost(h)) to compute hash function + O(1)O(1)O(1) to add it to the front of the linked list. This is if we assume that we are not checking for duplicate keys while inserting.

How does this affect the running time of search? Well, now once we find the bucket in O(cost(h))O(cost(h))O(cost(h)) time using the hash function, we may need to traverse the entire linked list to find the key. So, it can take up to O(length of the longest chain)O(\text{length of the longest chain})O(length of the longest chain), which is O(n)O(n)O(n) in the worst case (when all the inserted keys collide into the same bucket - as obviously intended by the adversary who is running your program)

Similarly, deletion also takes O(n+cost(h))O(n + cost(h))O(n+cost(h)) time.

Question

Imagine that we implement a symbol table using a single-ended Queue as the data structure. What would be the worst-case time complexities of insert, delete and search operations? (Assume we do not try to insert duplicate keys. You can only access the queue using standard queue operations, e.g., enqueue and dequeue.)

Answer: Insertion takes O(1)O(1)O(1), Search and Deletion takes O(n)O(n)O(n)

Question

Assume that we're using a hash function to map keys to their respective positions in the symbol table. The time complexity for computing the hash is O(b) for an input key of size b while array access and comparing keys is constant time complexity. Assuming collisions are resolved by chaining linked lists and m elements are present in the table, and assuming there are never duplicate values inserted, what is the worst-case time complexity of insert and search operations on keys containing n bits?

Answer: Insertion O(n)O(n)O(n), Search O(n+m)O(n + m)O(n+m)

Simple Uniform Hashing Assumption (SUHA)

  1. Every key is equally likely to map to every bucket.

  2. Keys are mapped independently of previously mapped keys.

Question

Why don’t we just insert each key into a random bucket, i.e., why do we need a specific hash function at all?

Answer: Because then searching would be very very slow. How would you find the item when you need it?

Load

The load of a hash table is the expected number of items per bucket. If we assume that we have nnn items and mmm buckets then,

$load(hash\ table) = \dfrac{n}{m} = average \ # items/bucket$$

So, Expected Search Time = O(1)O(1)O(1) + Expected number of items per bucket (Assuming it takes O(1)O(1)O(1) for hash function computation and array access)

Let us calculate the expected search time in such a case.

Let XXX be a random variable defined by:

X(i,j)={1if item i is put in bucket j0otherwise\begin{equation*} X(i,j) = \begin{cases} 1 \qquad \text{if item } i \text{ is put in bucket }j \\ 0 \qquad \text{otherwise} \end{cases} \end{equation*}X(i,j)={1if item i is put in bucket j0otherwise​​

We define the random variable in this way so that for any bucket jjj, ∑iX(i,j)\sum_i X(i,j)∑i​X(i,j) gives us the number of items in that bucket.

P(X(i,j)=1)=1mP(X(i,j) = 1) = \dfrac{1}{m}P(X(i,j)=1)=m1​ for any value of i,ji,ji,j. This is simply because there are mmm possible buckets for any item and each can be picked with equal probability.

E(X(i,j))=P(X(i,j)=1)×1+P(X(i,j)=0)×0=1mE(X(i,j)) = P(X(i,j) = 1)\times 1 + P(X(i,j) = 0)\times 0 = \dfrac{1}{m}E(X(i,j))=P(X(i,j)=1)×1+P(X(i,j)=0)×0=m1​

So, expected number of items in bucket bbb would be:

∑iX(i,b)=nm\sum_i X(i,b) = \dfrac{n}{m}∑i​X(i,b)=mn​. (Each item contributes 111 to the bucket it is in).

Hence, Expected Search Time = O(1)O(1)O(1) + nm\dfrac{n}{m}mn​. (So, we take m=Ω(n)m = \Omega(n)m=Ω(n) buckets, eg. m=2nm = 2nm=2n)

(But it is important to realise that the worst-case search time (not expected running time!) is still O(n)O(n)O(n)).

Question

What if you insert nnn elements in your hash table which has nnn buckets. What is the expected maximum cost?

Answer: This is like throwing nnn balls into nnn bins. The expected maximum number of balls in a bin is O(log⁡n)O(\log n)O(logn). Actually, a tighter bound would be Θ(logn/log(logn))\Theta(logn/log(logn))Θ(logn/log(logn)) → it’s not that trivial to prove.

Open Addressing

Advantages

  • No linked lists!

  • All data is stored directly in the table

  • One item per slot

Logic: On collision, probe a sequence (deterministic) of buckets until you find an empty one

Example of a probe sequence: linear probing: (if h(k)h(k)h(k) is taken, then look at bucket h(k)+1h(k) + 1h(k)+1 and so on, until you find a bucket that is empty)

So now for this we need to have a new hash function,

h(key,i):U→{1,2,…,m}h(key, i) : U \rightarrow \{1,2,\dots, m\}h(key,i):U→{1,2,…,m} where key=key =key= the item to map, i+1=i + 1=i+1= the number of collisions encountered so far.

Probing Code (Need not be Linear Probing)

hash-insert(key, data) {
	int i = 1;
	while (i <= m) {
		int bucket = h(key, i);
		if (T[bucket] == null) { // Found an empty bucket
			T[bucket] = {key, data}; // Insertion
			return success;
		}
		i++;
	}
	throw new TableFullException(); // you visited every bucket and checked that it was full :(
hash-search(key) {
	int i = 1;
	while (i <= m) {
		int bucket = h(key, i);
		if (T[bucket] == null) // Empty bucket! If this is empty, we know that the key cannot be
		// after this because it would have come here during the probe sequence
			return key-not-found;
		if (T[bucket].key == key) // Full bucket
			return T[bucket].data;
		i++;
	}
	return key-not-found; // Exhausted entire table

Deleting a Key

If you just remove the key from the table and set it to null, during search, the hash-search function may return key-not-found even if the key exists (because of the “gap” in the probe sequence).

Simple solution: Don’t set the value to null on deletion, set it to DELETED to indicate that the slot is empty now but there was an element here before.

If you encounter a DELETED value while probing for a bucket to insert, you can feel free to insert the new item there (because DELETED is just a placeholder that can be safely overwritten)

Hash Functions

2 Properties of a good hash function:

  1. h(key,i)h(key, i)h(key,i) enumerates all possible buckets as iii iterates from 111 to mmm (So that you only return TableFullException() when the table is actually full and you have made sure that there are no empty slots).

    1. For every key keykeykey, and for every bucket j,∃ i, h(key,i)=jj, \exists \ i, \ h(key, i) = jj,∃ i, h(key,i)=j

    2. The hash function is a permutation of {1,2,…,m}\{1,2,\dots,m\}{1,2,…,m}

  2. Uniform Hashing Assumption (UHA) - Every key is equally likely to be mapped to every permutation, independent of every other key. (There are m!m!m! permutations for a probe sequence, where m=m =m= table size.

Note that Linear Probing satisfies the first property but does not satisfy the second property! (For example, it is impossible to find a key kkk such that its hash generates the permutation 2,1,32,1,32,1,3 for a table of size 3. This is because the only possible permutations generated by the linear probing sequence is 1,2,31,2,31,2,3 or 2,3,12,3,12,3,1 or 3,1,23,1,23,1,2)

Problem with Linear Probing: Clusters

If there is a cluster (when a lot of elements are placed in contiguous buckets), there is a higher probability that the next h(k)h(k)h(k) will hit the cluster. Because if h(k,1)h(k,1)h(k,1) hits the cluster (any bucket in the cluster), the cluster grows bigger as the item gets added to the end of the cluster. This is an example of “rich get richer”.

If the table is 1/41/41/4 full, there will be clusters of Θ(log⁡n)\Theta(\log n)Θ(logn). This ruins constant-time performance for searching and inserting.

But, in practice linear probing is very fast! Why?

Because of caching! It is cheap to access nearby array cells (since the entire block containing a lot of contiguous array cells is loaded into the main memory). For example, if you want to access A[17], the cache loads A[10...50]. Then, it takes almost 0 cost to access any cells in that range. Recall that block transfer time is far far more than memory access time.

So, even though there will be clusters of θ(logn)\theta(logn)θ(logn) when the table is 1/41/41/4 full, the cache may hold the entire cluster. So, this makes linear probing no worse than a wacky probe sequence that circumvents the issue of clusters.

Good hashing functions

We saw that linear probing does not meet the second criteria of a good hash function, i.e., it does not satisfy UHA. So, the question is: How do we get a good hash function that satisfies UHA? Double hashing of course!

Start with 2 ordinary hash functions satisfying SUHA: f(k),g(k)f(k), g(k)f(k),g(k)

Define a new hash function: h(k,i)=(f(k)+i×g(k)) mod mh(k, i) = (f(k) + i\times g(k))\ mod \ mh(k,i)=(f(k)+i×g(k)) mod m

Since f(k)f(k)f(k) is pretty good, h(k,1)h(k,1)h(k,1) is “almost random”.

Since g(k)g(k)g(k) is pretty good, the probe sequence becomes “almost random” (we just added some “noise” to the value to increase randomness). (we need to mod mmod \ mmod m in the end so that we get a value between 000 and 1−m1-m1−m and we can map it to a bucket of table size mmm.)

h(k,i)=(f(k)+i×g(k)) mod mh(k,i) = (f(k) + i\times g(k))\ mod \ mh(k,i)=(f(k)+i×g(k)) mod m

Claim: if g(k)g(k)g(k) is relatively prime (co-prime) to mmm, then h(k,i)h(k,i)h(k,i) hits all buckets (i.e., it generates a permutation of {1,…,m}\{1,\dots,m\}{1,…,m}.

Proof: Suppose not. Then for some distinct i,j<mi, j< mi,j<m (since it does not hit all buckets then there must be two equal values of h(k,i)h(k,i)h(k,i) for two distinct i,ji,ji,j - by Pigeonhole Principle):

(f(k)+ig(k)) mod m=(f(k)+jg(k)) mod mig(k) mod m=jg(k) mod m(since (a+b)mod m = ((a mod m) + (b mod m)) mod m ) (i−j)g(k)=0 mod m  ⟹  g(k) is not relatively prime to m since i,j<m\begin{equation*} \begin{split} (f(k)+ ig(k)) \ mod \ m&= (f(k) + jg(k)) \ mod \ m \\ ig(k) \ mod \ m &= jg(k) \ mod \ m \text{\quad (since (a+b)mod m = ((a mod m) + (b mod m)) mod m ) } \\ (i - j)g(k) &= 0 \ mod \ m \\ \implies & g(k) \text{ is not relatively prime to }m \text{ since }i,j < m \end{split} \end{equation*}(f(k)+ig(k)) mod mig(k) mod m(i−j)g(k)⟹​=(f(k)+jg(k)) mod m=jg(k) mod m(since (a+b)mod m = ((a mod m) + (b mod m)) mod m ) =0 mod mg(k) is not relatively prime to m since i,j<m​​

Example: If m=2rm = 2^rm=2r, then choose g(k)g(k)g(k) to be odd for all keys kkk.

Performance of Open Addressing

Let Load α=n/m=Average#itemsbucket\alpha = n/m = Average \# \dfrac{ items}{bucket}α=n/m=Average#bucketitems​. Assume α<1\alpha < 1α<1 (the table is not full)

Claim: For nnn items, in a table of size mmm, assuming uniform hashing, the expected cost of an operation (insert, search, delete) is: ≤11−α\leq \dfrac{1}{1-\alpha}≤1−α1​.

Example: if α=90%\alpha = 90\%α=90%, then E[#probes]=10E[\#probes] = 10E[#probes]=10.

Proof of Claim:

Say, you have already inserted nnn items and you want to insert the next element kkk.

The probability that the first bucket you find (i.e., h(k,1)h(k,1)h(k,1)) is full is n/mn/mn/m. Then, the probability that the second bucket is full given that the first bucket is also full is n−1m−1\dfrac{n-1}{m-1}m−1n−1​. So, you need to probe again: probability that the third bucket is also full given that the first two are full: n−2m−2\dfrac{n-2}{m-2}m−2n−2​ and so on..

Expected cost: 1+nm(1+n−1m−1(1+n−2m−2… ))1 + \dfrac{n}{m}\left( 1 + \dfrac{n-1}{m-1}\left( 1 + \dfrac{n-2}{m-2} \dots\right)\right)1+mn​(1+m−1n−1​(1+m−2n−2​…))

Note: n−im−i≤nm=α\dfrac{n - i}{m - i} \leq \dfrac{n}{m} = \alpham−in−i​≤mn​=α. (only when nm<1\dfrac{n}{m} < 1mn​<1). For example, 4−13−1=32>43\dfrac{4-1}{3-1} = \dfrac{3}{2} > \dfrac{4}{3}3−14−1​=23​>34​)

Therfore, expected cost is ≤1+α(1+α2(1+… ))≤1+α+α2+⋯=11−α\leq 1 + \alpha(1 + \alpha^2(1 + \dots)) \leq 1 + \alpha + \alpha^2 + \dots = \dfrac{1}{1-\alpha}≤1+α(1+α2(1+…))≤1+α+α2+⋯=1−α1​

If n<mn < mn<m, the expected cost of searching in a hash table using open addressing is O(1)O(1)O(1). Is this true or false?

False. Let n=m−1n = m - 1n=m−1. Then 11−α=11−(n−1)/n=11−(1−1/n)=11/n=n\dfrac{1}{1-\alpha} = \dfrac{1}{1-(n-1)/n} = \dfrac{1}{1-(1-1/n)} = \dfrac{1}{1/n} = n1−α1​=1−(n−1)/n1​=1−(1−1/n)1​=1/n1​=n so it will take O(n)O(n)O(n) to search for an item in a hash table with nnn items and n+1n+1n+1 buckets. Moreover, we know that when we use linear probing and even if the table is quarter full, we will have clusters of size Θ(logn)\Theta(logn)Θ(logn) in expectation. So, the cost of searching in a hash table using open addressing is at least lognlognlogn.

You are only allowed to use O(11−α)O(\dfrac{1}{1-\alpha})O(1−α1​) under the assumption of UHA. Linear probing, quadratic probing, etc. do not satisfy UHA and you cannot use this time complexity!

If mmm (table size) is prime, and α<0.5\alpha < 0.5α<0.5, only then we can guarantee that quadratic probing hits all the buckets in the table. Sometimes, it is possible that quadratic probing finds an empty slot in the table after mmm tries too (which is why some people wait for a longer time before they return a tableFullException().

Summary of Open Addressing

Advantages:

  • Saves space - Empty slots vs linked lists (in chaining)

  • Rarely allocate memory (only during table resizing) - no new list-node allocations

  • Better cache performance - table all in one place in memory. Fewer accesses to bring the table into cache. Linked lists can wander all over memory.

Disadvantages:

  • More sensitive to choice of hash function: clustering is a common problem in case of linear probing

  • More sensitive to load (than chaining) - performance degrades very badly as α→1\alpha \rightarrow 1α→1

Chaining
Open Addressing

Table Resizing

Throughout this discussion of table resizing, assume that we are using hashing with chaining and our hash function obeys simple uniform hashing (although this is a false assumption - it can never be true in practice)

We know that the expected search time is O(1+n/m)O(1 + n/m)O(1+n/m).

So, the optimal size of the table is m=Θ(n)m = \Theta(n)m=Θ(n).

The problem is we don’t know nnn in advance. If m<2nm < 2nm<2n, then there’s a possibility of too many collisions, and if m>10nm > 10nm>10n , there’s too much wasted space (of course the numbers 2 and 10 here are arbitrary but it provides some intuition as to why we need to grow and shrink our table size as new elements are added and old elements are deleted from the table).

Idea: Start with a small constant table size. Grow and shrink the table as necessary.

  1. Choose a new table size m′m'm′

  2. Choose a new hash function h′h'h′ (Why do we need a new hash function? Because the hash function depends on the table size! h:U→{1,…,m}h: U \rightarrow \{1,\dots,m\}h:U→{1,…,m}. Java hides this by doing the hashing itself)

  3. For each item in the old hash table, compute the new hash function and copy item to the new bucket.

How much time would this take? Well, it takes O(m)O(m)O(m) to access mmm buckets. In each of the mmm buckets, we need to access the elements themselves. There are nnn elements. So, it takes O(n)O(n)O(n) elements for that. ALSO, allocating memory takes time proportional to the memory size being allocated. That is, if you initialise a large array, that takes some time. So, to allocate a table of size m2m_2m2​, it takes O(m′)O(m')O(m′).

Hence, the total time to grow/shrink the table is O(m+m′+n)O(m + m' + n)O(m+m′+n).

But... How much should we grow by? When should we grow?

Idea 1: Increment table size by 1

This is a ridiculously bad idea because each time you add a new element, you need to create an entirely new table. This takes too much time (in particular, the total time for inserting nnn items would be O(n2)O(n^2)O(n2)) This is because you are thinking short-term and each time you insert after growing the table, the table becomes full again. So, when you insert again, you need to resize.

Note that this is also true for incrementing the table size by a constant number (i.e., m′=m+cm' = m + cm′=m+c will also take O(n2)O(n^2)O(n2) for large values of nnn no matter how large ccc is). Unfortunately, even seasoned software engineers write code that grows the table size by a constant factor like 1000, forgetting that it is still a bad design decision which leads to O(n2)O(n^2)O(n2) complexity.

Idea 2: Double table size

if (n==m):m′=2m(n == m): m' = 2m(n==m):m′=2m (when the table is full, create a new table whose size is double - so there is sufficient space for insertions to occur)

Then, you perform expansions only O(logn)O(logn)O(logn) times while inserting nnn items. So, the total cost of inserting nnn items is O(n)O(n)O(n).

For example, let your initial table size be 8 (it’s always good practice to keep your table size a power of 2). Then, the cost of resizing is:

Cost of resize: O(n)O(n)O(n)

Cost of inserting nnn items + resizing: O(n)O(n)O(n)

Most insertions take O(1)O(1)O(1) time (since no resize is required). Some insertions are expensive (linear cost).

The average cost per operation is O(1)O(1)O(1). In fact, the amortized cost of insertion is O(1)O(1)O(1) - think of depositing $3\$3$3 each time you insert an element, with each insert operation itself costing $1\$1$1. You use the additional $2\$2$2 to grow the table size (assuming growing to a size mmm requires $m\$m$m and so, each element needs to have $2\$2$2 to contribute to the expansion process). Observe that you are growing the table only after filling it completely and so you are sure that you have added mmm new items before the next growth occurs (assuming current table size is mmm). Similarly for deletion, deposit $3\$3$3 for each delete operation: $1\$1$1 to fund the delete operation and $2\$2$2 for future shrink operations. So, each operation requires a constant amount of money and hence is amortized O(1)O(1)O(1).

Idea 3: Squaring table size

If doubling table size is good, squaring table size must be better right? Well, no. not really. In fact, it is O(n2)O(n^2)O(n2). This is because you’re allocating toooo much memory at each resize. For example, if your table size is 64 and you insert the 65th element, the table size grows to 64*64 = 4096!

Each resize takes O(n2)O(n^2)O(n2) time. So, average cost of each operation is O(n)O(n)O(n).

Shrinking Table

When the table becomes too big, we need to shrink the table. Again, the question is: When do we shrink?

Attempt 1

if (n==m):m′=2m(n == m) : m' = 2m(n==m):m′=2m (grow) and if (n<m/2):m′=m/2(n < m/2) : m' = m/2(n<m/2):m′=m/2 (shrink).

This is actually a really bad idea! Consider the following problem:

You might have to resize at every insertion and deletion

Attempt 2

If (n==m):m′=2m(n== m): m' = 2m(n==m):m′=2m (grow when the table is full)

If (n<m/4):m′=m/2(n < m/4): m' = m/2(n<m/4):m′=m/2 (shrink when the table is less than a quarter full so that when you half the size, there is still sufficient space for insertions to be performed)

Claim (very important invariants that you should be able to observe for any kind of growth/shrink rules given):

  • Every time you double a table of size mmm, at least m/2m/2m/2 new items were added (after your previous grow/shrink operation)

  • Every time you shrink a table of size mmm, at least m/4m/4m/4 items were deleted. (after your previous grow/shrink operation)

These two claims can be easily proven by observation.

By analysing the amortized time complexity, we find that insertion and deletion both take O(1)O(1)O(1) amortized.

Note that search takes expected time O(1)O(1)O(1) (not amortized!)

  • Insertions and deletions are amortized because of table resizing (i.e., an insertion/deletion algorithm can trigger a table resize once in a while - which is expensive)

  • Inserts are not randomized (because we are not searching for duplicates)

  • Searches are expected (because of randomization of length of chains in each bucket) and not amortized (since no table resizing on a search)

Suppose we follow these rules for an implementation of an open-addressing hash table, where n is the number of items in the hash table and m is the size of the hash table. (a) If n = m, then the table is quadrupled (resize m to 4m) (b) If n < m/4, then the table is shrunk (resize m to m/2) What is the minimum number of insertions between 2 resize events? What about deletions?

Solution: Insertions: m/2 + 1; Deletions: 1 Suppose that the new inserted entry will cause n = m. Then the table will be resized into m′ = 4m. When we delete an entry, the number of entry will be n−1 = m−1 = m′/4−1 < m′/4. The table will resize again. Hence, the minimum number of deletion before resizing is 1.

At this time, the new table size is m′′ = m′/2 = 2m. The number of insertions before it expands is m′′ − (n − 1) = m′′ − (m′′/2 − 1) = m′′/2 + 1 elements. Note that the answer is calculated relative to the current table size.

Designing good hash functions

Simple Uniform Hashing does not exist! It is a false assumption made merely to simplify the analysis and understanding. Even UHA is a false assumption.

This is because keys are not random. There’s lots of regularity between keys and often there are mysterious patterns too! Patterns in keys can induce patterns in hash functions unless you’re very careful.

Consider the following example:

What if we want to hash strings and we keep one bucket for each of the 26 letters from a through z. Then, we define our hash function to be h(string)=first letterh(string) = first\ letterh(string)=first letter of the string. for example, h("hippopotamus")=hh("hippopotamus") = hh("hippopotamus")=h.

This is a bad hash function! Why? Because many fewer words start with the letter xxx than with the letter sss. So, the length of the chain in bucket sss will be huge while that in bucket xxx will be very small.

Okay, what if we try another function: one bucket for each number from 1 to 26*28 (since the longest word in the english language has 28 letters) and we define the hash function to be hhh(string) = sum of the letters. Eg. hhh(”hat") = 8+1+20=298 + 1 + 20 = 298+1+20=29.

Umm, this is also a bad hash function. Lots of words collide and you won’t get a uniform distribution because most words are short and so the sum will likely be low)

But pretty good hash function exists (eg. SHA256 is a cryptographically secure hash function)

The moral of the story is simple: don’t ever design your own hash function unless you absolutely need to.

Designing Hash Functions

Goal: Find a hash function whose valus look random (it is impossible to get 100%100\%100% randomness since we are using a deterministic algorithm)

This is similar to pseudorandom generators - there’s no real randomness. Instead, the sequence of numbers generated just looks random.

For every hash function, there is some set of keys which leads to bad outputs!

But, if you know the keys in advance, you can choose a hash function that is always good. But if you chaneg the keys, then it might be bad again.

There are 2 common techniques used to design hash functions:

  1. Division Method

  2. Multiplication Method

Division Method

h(k)=k mod mh(k) = k\ mod \ mh(k)=k mod m

Example: If m=7m = 7m=7, then h(3)=3,h(10)=3,h(8)=1h(3) = 3, h(10) = 3, h(8) = 1h(3)=3,h(10)=3,h(8)=1

Two keys, k1k_1k1​ and k2k_2k2​, collide when k1 mod m=k2 mod mk_1 \ mod \ m = k_2 \ mod \ mk1​ mod m=k2​ mod m

Collision is unlikely if keys are random.

(Bad) Idea: choose m=2xm = 2^xm=2x becaue it is very fast to calculate k mod mk \ mod \ mk mod m via bit shifts.

Recall that 001001>>2=0010001001 >> 2 = 0010001001>>2=0010 (just remove the last 2 bits).

Then, it can be shown that k mod 2x=k−((k>>x)<<x)k \ mod \ 2^x = k - ((k >> x) << x)k mod 2x=k−((k>>x)<<x).

What is the problem with this? Input keys are often regular. Assume that input keys are all even. Then h(k)=k mod mh(k) = k \ mod \ mh(k)=k mod m is always even.

Note: The remainder when yyy is divided by xxx (i.e. y%xy \% xy%x) is always a multiple of the common divisors (in fact gcdgcdgcd) of yyy and xxx. (easy to prove)

So, if ddd is a divisor of mmm and every key kkk, then what percentage of the table is used? 1/d1/d1/d.

Because all the remainders generated are multiples of ddd. So, there are only m/dm/dm/d such buckets in the table. Hence, out of mmm buckets, only m/dm/dm/d buckets would be used. So, 1/d1/d1/d of the table is used. (you use only 1 slot out of every d slots).

So, we choose m=m=m= prime number (to minimize the chances of having any divisor common between mmm and the keys). It should not be too close to a power of 2 or a power of 10 (since those are common input keys)

Overall, the division method is popular (and easy) but not always the most effective. This is also because division is often slow.

Questions on Hashing

Implementing sets using Hash Tables

Consider the following implementations of sets. How would intersect and union be implemented for each of them? (a) Hash table with open addressing (b) hash table with chaining

(a) Solution: For intersects, we would need to iterate through the bins in one set a and check if the element is also present in the other set b. Elements that are present in both are then placed in the result set r, which can be initialised to an appropriate capacity given that we know the sizes of a and b. Under the uniform hashing assumption, the expected complexity is O(ma+na(11−αb+11−αr))O\left(m_a + n_a\left(\dfrac{1}{1-\alpha_b} + \dfrac{1}{1-\alpha_r} \right)\right)O(ma​+na​(1−αb​1​+1−αr​1​)), where mam_ama​ denotes the size of hash table aaa and nan_ana​ as the number of entries in hash table aaa. For unions, we can iterate through the elements in set bbb and insert them into set aaa. With similar analysis as in intersects, this is in O(mb+nb(11−αa))O\left(m_b + n_b\left(\dfrac{1}{1-\alpha_a}\right)\right)O(mb​+nb​(1−αa​1​)) where mbm_bmb​ and nbn_bnb​ is defined similarly.

(b) We can use the same strategies for intersects and unions as in 4a). The runtime of both these solutions are O(ma+∑k∈alen(hb(k)))O(m_a + \sum_{k\in a} len(h_b(k)))O(ma​+∑k∈a​len(hb​(k))), and O(mb+∑k∈blen(ha(k))O(m_b + \sum_{k\in b} len(h_a(k))O(mb​+∑k∈b​len(ha​(k)) respectively. len(ht(x))len(h_t(x))len(ht​(x)) denotes the number of entries in the bucket ht(x)h_t(x)ht​(x) in table ttt. In practice, this solution would be good enough, as the length of chains should be O(1)O(1)O(1) (under SUHA, and an appropriate load factor). But to mitigate bad hash functions, Java implements an interesting strategy where buckets exceeding a threshold size are turned into a tree. While this increases insertion times to O(log size(h(x)))O(log\ size(h(x)))O(log size(h(x))), this improves searches in each bucket to the same complexity. Using an ordered structure like a tree also means that in the very specific scenario where the two sets’ capacities and hash functions are the same, we unions and intersects can be done in exactly O(size(a)+size(b)+m)O(size(a) + size(b) + m)O(size(a)+size(b)+m)

Table Resizing question

Answer: O(1)O(1)O(1). The explanation is as follows:

PreviousTrieNext(a, b)-Trees

Last updated 8 months ago

Was this helpful?

When , we can still add new items to the hash table

When , the table is full and we cannot add any more items

We can still search efficiently when

We cannot search efficiently when

m==nm== nm==n
m==nm == nm==n
m==nm == nm==n
m==nm == nm==n