Disjoint Set Union
Also known as Union Find Disjoint Set
Last updated
Also known as Union Find Disjoint Set
Last updated
Problem: Given some elements, a user can perform two types of operations: Union(u, v)
(in which case the set containing and are merged) and Find(u, v)
(in which case we need to determine whether and are in the same set). Our aim is to make both operations as fast as possible.
This is a problem on dynamic connectivity. Union means that the two objects are connected and Find asks us to check whether there is a path connecting the two objects.
Model each object as a node. Each edge represents connectedness. Union
will take since you just need to connect the two nodes. Find
will take since you need to check if there is a path between the nodes. In this case, a connected component will represent a maximal set of mutually connected objects (obviously the edges are undirected).
But we can make Find
faster?
Since the number of objects is fixed, we can use an array to store the componentId to which the object belongs. To convert an object to an integer (to be used in an array index) we can use a HashMap with open addressing (since we absolutely don’t want collisions - otherwise their indices will get messed up).
So array[i]
stores the component identifier of . We can let the component identifier be the object with the lowest id in the set.
Then, two objects are connected if they have the same component identifier. Hence, Find
becomes But what about Union
?
Now, to Union
two components, we need to scan the entire array and change all the componentId of one set to match the other. This takes .
Pseudocode:
If you think of a connected component as being in the shape of a tree, each node in the tree stores the root of the tree to uniquely identify the tree. So two nodes are in the same set (tree) if they have the same root.
We use the same idea as before but instead of storing the root of the tree, we only store the parent of each object. As before, two objects are connected if they are part of the same tree. For example,
Notice that this tree need not be a binary tree (a node can have many chidren).
In deciding which root to make the root (during Union
), we can ensure that our trees are more balanced by making the smaller tree a child of the larger tree’s root. We decide which tree becomes the “parent” using their weights.
Well, it is easy to see that the height of our tree only increases when another tree having equal or more weight is added to our tree. If we add a smaller tree as a child of our tree, the height of our tree will not increase. In other words, height only increases when total size doubles.
Proof by induction:
After finding the root during some find
or union
operation, we can set the parent of each traversed node to the root (while we are traversing the nodes from the leaf-to-root path, we can just store these nodes in an array and once we find the root, assign the root to be the parent of all these nodes). Essentially, now we are trying to make the height of the tree as small as possible by “flattening” or “compressing” the path between the node and the root.
It turns out that for path compression to work well, we don’t even need to assign the root to be the parent of every traversed node. Even if we make every node in the path point to its grandparent, it works just as well!
Tarjan (1975) gives an upper bound for the running time of Union-Find with weight union and path compression
Can we do better than this? Nope! Tarjan also proved that it is impossible to achieve linear time (although we did get really close to linear time!)
What is the worst-case running time of the find operation in Union-Find with path compression (but no weighted union)?
Here’s another algorithm for Union-Find based on a linked list. Each set is represented by a linked list of objects, and each object is labelled (e.g., in a hash table) with a set identifier that identifies which set it is in. Also, keep track of the size of each set (e.g., using a hash table). Whenever two sets are merged, relabel the objects in the smaller set and merge the linked lists. What is the running time for performing m Union and Find operations, if there are initially n objects each in their own set? More precisely, there is: (i) an array id
where id[j]
is the set identifier for object j
; (ii) an array size where size[k]
is the size of the set with identifier k
; (iii) an array list where list[k]
is a linked list containing all the objects in set k
.
Assume for the purpose of this problem that you can append one linked list on to another in O(1) time.
The table below shows the running time of each operation of find and union:
In a maze, we can find whether 2 locations are connected using the find operation
In a game, we can check whether we can get from one state to another
We can find the least-common-ancestor in a tree
In a graph, we can find connected components (e.g. largest connected component in graph with common factor)
The problem with this implementation is that the intuition is correct but there is no guarantee that our tree has height . Both Find
and Union
rely on travelling up the entire height of the tree to find the root and this may take at most . (It shouldn’t be called QuickUnion since it can’t even perform union quickly lol)
We need to make some optimisations to make sure our tree has height and the tree is as flat as possible.
Okay, this may give us a slight improvement over the previous method but how do we know that our heights are now ?
Claim: A tree of height has at size at least . Or, the height of a tree of size is at most
How do you get a tree of height ? You make a tree of height the child of another tree. By induction hypothesis, this tree of height has size at least . Moreover, since you are making it a child of the other tree rather than the other way around, we know that the size of the other tree is greater than (by our union-weight rule). So, the total weight of our final resultant tree of height is . Hence, proved.
Now since both our find and union operations just traverse the tree once, they run in time.
Note that we could also do union-by-rank (instead of union-by-weight) in which case . We could have also done union-by-height. In fact, the important property is that weight/rank/size/height of a subtree does not change except at root (so we only need to update the root when we union) and moreover, the weight/rank/size/height only increases when tree size doubles.
We have already achieved for find and union. But can we do any better? It turns out we can!
Theorem (Tarjan, 1975): Starting from empty, any sequence of union/find operations on object takes time.
Here, is the inverse ackermann function (in our universe, it is always less than 5). BUT the inverse ackermann is not a constant, i.e., .
Remember that is the AMORTIZED cost of an operation in UFDS with path compression and weighted union - NOT the worst case of an operation.
Ans: (In the worst case, the tree is completely unbalanced - a straight line). Path compression won’t help if you keep calling union on the root of the tree and adding the entire tree as a subtree of that one node.
What is the worst case running time for a Find
operation on the UFDS that is using both weighted union and path compression. Assume there are at most disjoint sets, initially each element is in its own set.
Ans: . Think of what would happen if you kept calling Union
on the roots of the distinct trees - there would be no path compression performed! Path compression ensures that once you have travelled a particular root-to-leaf path entirely, you don’t need to travel it again. But, you at least need to do it once in order to do the path compression in the first plae. More generally, if a Find
or Union
operation has not been called on or or any of their anccestors (except the root), then there is no effective “path compression”. So, the worst case would still be but notice that after you call Find(u,v)
, all subsequent calls of Find
to any of parents or parents would take since they are directly connected to the root. (Assuming you don’t perform more Union
operations in between).
In short, when you use weighted union, you guarantee that the size of the subtree is always (since you use it for every call to Union
) and so, Union
and Find
cannot take more than that. But path compression has its limitations and can be side-stepped by an adversarial input that ensures no effective path compression is taking place at all.
Find operations obviously cost . For union operations, the cost is . The only expensive part is relabelling the objects in list[k2]
. And notice that, just like in Weighted Union, each time we union two sets, the size of the smaller set at least doubles. So each object can be relabelled at most logn times (as we can double the size of a set at most logn times). Note that since there are union operations, the biggest set after those operations is of size , and as each object in that set was updated at most times, the total cost is . Of course, notice that any one operation can be expensive (each union operation have different costs). For example, the last union operation might be combining two sets of size and hence have cost , while the first union operation would have a cost of . The appending of one linked list to the end of another is pretty easily done in O(1) through manipulation of the head and tail pointers.
Note: actually for path compression, the first time you run find
or union
on the node, it is possible that it takes since you are not ensuring balanced binary tree.