Disjoint Set Union

Also known as Union Find Disjoint Set

Problem: Given some elements, a user can perform two types of operations: Union(u, v) (in which case the set containing $u$ and $v$ are merged) and Find(u, v) (in which case we need to determine whether $u$ and $v$ are in the same set). Our aim is to make both operations as fast as possible.

This is a problem on dynamic connectivity. Union means that the two objects are connected and Find asks us to check whether there is a path connecting the two objects.

Naive Solution

Model each object as a node. Each edge represents connectedness. Union will take $O(1)$ since you just need to connect the two nodes. Find will take $O(V+E)$ since you need to check if there is a path between the nodes. In this case, a connected component will represent a maximal set of mutually connected objects (obviously the edges are undirected).

But we can make Find faster?

QuickFind

Since the number of objects is fixed, we can use an array to store the componentId to which the object belongs. To convert an object to an integer (to be used in an array index) we can use a HashMap with open addressing (since we absolutely don’t want collisions - otherwise their indices will get messed up).

So array[i] stores the component identifier of $i$ . We can let the component identifier be the object with the lowest id in the set.

Then, two objects are connected if they have the same component identifier. Hence, Find becomes $O(1).$ But what about Union?

Now, to Union two components, we need to scan the entire array and change all the componentId of one set to match the other. This takes $O(n)$ .

Pseudocode:

find(int p, int q)
	return(componentId[p] == componentId[q]);

union(int p, int q) updateComponent = componentId[q]
	for (int i=0; i<componentId.length; i++)
		if (componentId[i] == updateComponent)
			componentId[i] = componentId[p];

If you think of a connected component as being in the shape of a tree, each node in the tree stores the root of the tree to uniquely identify the tree. So two nodes are in the same set (tree) if they have the same root.

QuickUnion

We use the same idea as before but instead of storing the root of the tree, we only store the parent of each object. As before, two objects are connected if they are part of the same tree. For example,

find(int p, int q):
	while (parent[p] != p) p = parent[p]; // traverse up till you reach the root of tree containing p
	while (parent[q] != q) q = parent[q]; // traverse up till you reach the root of tree containing q
	return (p == q); // p and q are connected if they are in the same tree -> same root

union(int p, int q)
	while (parent[p] != p) p = parent[p]; // find p's root
	while (parent[q] != q) q = parent[q]; // find q's root
	parent[p] = q; // let the root of p's tree point to the root of q's tree (if you directly connect p to q, p will have multiple parents)

The problem with this implementation is that the intuition is correct but there is no guarantee that our tree has height $O(\log n)$ . Both Find and Union rely on travelling up the entire height of the tree to find the root and this may take at most $O(n)$ . (It shouldn’t be called QuickUnion since it can’t even perform union quickly lol)

Notice that this tree need not be a binary tree (a node can have many chidren).

We need to make some optimisations to make sure our tree has height $O(logn)$ and the tree is as flat as possible.

Optimisations

1. Weighted Union (Rank Heuristic)

In deciding which root to make the root (during Union), we can ensure that our trees are more balanced by making the smaller tree a child of the larger tree’s root. We decide which tree becomes the “parent” using their weights.

union(int p, int q)
	while (parent[p] !=p) p = parent[p]; 
	while (parent[q] !=q) q = parent[q]; 
	if (size[p] > size[q] {
		parent[q] = p; // Link q to p 
		size[p] = size[p] + size[q];
	} else {
		parent[p] = q; // Link p to q 
		size[q] = size[p] + size[q];
  }

Okay, this may give us a slight improvement over the previous method but how do we know that our heights are now $O(logn)$ ?

Well, it is easy to see that the height of our tree only increases when another tree having equal or more weight is added to our tree. If we add a smaller tree as a child of our tree, the height of our tree will not increase. In other words, height only increases when total size doubles.

Claim: A tree of height $k$ has at size at least $2^k$ . Or, the height of a tree of size $n$ is at most $logn$

Proof by induction:

How do you get a tree of height $k$ ? You make a tree of height $k - 1$ the child of another tree. By induction hypothesis, this tree of height $k - 1$ has size at least $2^{k-1}$ . Moreover, since you are making it a child of the other tree rather than the other way around, we know that the size of the other tree is greater than $2^{k-1}$ (by our union-weight rule). So, the total weight of our final resultant tree of height $k$ is $\geq 2^{k -1} + 2^{k-1} = 2^k$ . Hence, proved.

Now since both our find and union operations just traverse the tree once, they run in $O(logn)$ time.

Note that we could also do union-by-rank (instead of union-by-weight) in which case $rank = log(size)$ . We could have also done union-by-height. In fact, the important property is that weight/rank/size/height of a subtree does not change except at root (so we only need to update the root when we union) and moreover, the weight/rank/size/height only increases when tree size doubles.

2. Path Compression

We have already achieved $O(logn)$ for find and union. But can we do any better? It turns out we can!

After finding the root during some find or union operation, we can set the parent of each traversed node to the root (while we are traversing the nodes from the leaf-to-root path, we can just store these nodes in an array and once we find the root, assign the root to be the parent of all these nodes). Essentially, now we are trying to make the height of the tree as small as possible by “flattening” or “compressing” the path between the node and the root.

findRoot(int p) {
	root = p;
	while (parent[root] != root) root = parent[root]; // we found the root
		while (parent[p] != p) {
			temp = parent[p]; // before assigning the root to be the parent, we need to store th current parent to traverse the path lol
			parent[p] = root; 
			p = temp; // now make the parent of each node the root
		}
  return root;
}

It turns out that for path compression to work well, we don’t even need to assign the root to be the parent of every traversed node. Even if we make every node in the path point to its grandparent, it works just as well!

findRoot(int p) {
	root = p;
	while (parent[root] != root) {
		parent[root] = parent[parent[root]]; 
		root = parent[root];
	}
  return root;
}

Tarjan (1975) gives an upper bound for the running time of Union-Find with weight union and path compression

Theorem (Tarjan, 1975): Starting from empty, any sequence of $m$ union/find operations on $m$ object takes $O(n + m\alpha(m,n))$ time.

Here, $\alpha$ is the inverse ackermann function (in our universe, it is always less than 5). BUT the inverse ackermann is not a constant, i.e., $\alpha \neq O(1)$ .

Remember that $O(\alpha)$ is the AMORTIZED cost of an operation in UFDS with path compression and weighted union - NOT the worst case of an operation.

Can we do better than this? Nope! Tarjan also proved that it is impossible to achieve linear time (although we did get really close to linear time!)

Questions

What is the worst-case running time of the find operation in Union-Find with path compression (but no weighted union)?

Ans: $O(n)$ (In the worst case, the tree is completely unbalanced - a straight line). Path compression won’t help if you keep calling union on the root of the tree and adding the entire tree as a subtree of that one node.

What is the worst case running time for a Find operation on the UFDS that is using both weighted union and path compression. Assume there are at most $n$ disjoint sets, initially each element is in its own set.

Ans: $O(logn)$ . Think of what would happen if you kept calling Union on the roots of the distinct trees - there would be no path compression performed! Path compression ensures that once you have travelled a particular root-to-leaf path entirely, you don’t need to travel it again. But, you at least need to do it once in order to do the path compression in the first plae. More generally, if a Find or Union operation has not been called on $u$ or $v$ or any of their anccestors (except the root), then there is no effective “path compression”. So, the worst case would still be $O(logn)$ but notice that after you call Find(u,v), all subsequent calls of Find to any of $u's$ parents or $v's$ parents would take $O(1)$ since they are directly connected to the root. (Assuming you don’t perform more Union operations in between).

In short, when you use weighted union, you guarantee that the size of the subtree is always $O(logn)$ (since you use it for every call to Union) and so, Union and Find cannot take more than that. But path compression has its limitations and can be side-stepped by an adversarial input that ensures no effective path compression is taking place at all.

Here’s another algorithm for Union-Find based on a linked list. Each set is represented by a linked list of objects, and each object is labelled (e.g., in a hash table) with a set identifier that identifies which set it is in. Also, keep track of the size of each set (e.g., using a hash table). Whenever two sets are merged, relabel the objects in the smaller set and merge the linked lists. What is the running time for performing m Union and Find operations, if there are initially n objects each in their own set? More precisely, there is: (i) an array id where id[j] is the set identifier for object j; (ii) an array size where size[k] is the size of the set with identifier k; (iii) an array list where list[k] is a linked list containing all the objects in set k.

Assume for the purpose of this problem that you can append one linked list on to another in O(1) time.

Find operations obviously cost $O(1)$ . For $m$ union operations, the cost is $m log n$ . The only expensive part is relabelling the objects in list[k2]. And notice that, just like in Weighted Union, each time we union two sets, the size of the smaller set at least doubles. So each object can be relabelled at most logn times (as we can double the size of a set at most logn times). Note that since there are $m$ union operations, the biggest set after those operations is of size $O(m)$ , and as each object in that set was updated at most $logn$ times, the total cost is $m log n$ . Of course, notice that any one operation can be expensive (each union operation have different costs). For example, the last union operation might be combining two sets of size $m/2$ and hence have cost $m$ , while the first union operation would have a cost of $O(1)$ . The appending of one linked list to the end of another is pretty easily done in O(1) through manipulation of the head and tail pointers.

Summary

The table below shows the running time of each operation of find and union:

Note: actually for path compression, the first time you run find or union on the node, it is possible that it takes $O(n)$ since you are not ensuring balanced binary tree.

Applications

In a maze, we can find whether 2 locations are connected using the find operation
In a game, we can check whether we can get from one state to another
We can find the least-common-ancestor in a tree
In a graph, we can find connected components (e.g. largest connected component in graph with common factor)

PreviousRandom Permutation Generation NextBinary Search Tree

Last updated 9 months ago

Was this helpful?