SCC (Tarjan's and Kosaraju's)
Last updated
Was this helpful?
Last updated
Was this helpful?
Recall that a connected component of an undirected graph is defined as follows: and are said to be in the same connected component if, and only if, there is a path from to .
But in case of directed graphs, it is possible to have a path from to but not from to to . Then, how do we define connectedness?
We define a Strongly Connected Component (SCC) as follows:
A subgraph is an SCC of graph if
there is a path from to and there is a path from to . (Here denotes the vertices in subgraph )
There is no bigger subgraph of that contains as its subgraph and satisfies property (1) - In other words, the SCC should be maximally sized (as big as possible)
Note:
If a graph is “strongly connected”, it has exactly 1 strongly connected component.
Any DAG with vertices has strongly connected components (since there are no cycles! so, even if there is a path from to , it is impossible to find a path from to due to absence of cycles)
This shows that any strongly connected component must have a cycle (or it must be an isolated node) for all its nodes to be reachable from each other.
Another cool property is that the graph of SCCs is a directed acyclic graph. In particular, when each SCC is replaced by one representative node in the graph, the resulting graph is acyclic. This is because different SCCs cannot be part of a cycle - otherwise, they could merge to form a larger SCC since all nodes would be reachable from them.
There are various algorithms to find SCCs. Some of the include:
Kosaraju’s algorithm
Tarjan’s algorithm
Both these algorithms run in time (though I much prefer Kosaraju's algorithm because it's more elegant).
It uses 2 DFS passes to determine the SCCs.
Run DFS from each unvisited node, and add it to a stack once you’re done processing it (i.e., store the nodes in reverse post-order manner)
Reverse all the edges of the graph
Pop from stack, run dfs and assign nodes to their components if they’re not already done so. Each component is identified by its root / “leader” which is the highest node in the stack (first node in its component to be processed in this step).
The code itself is easy to understand, but the main question is: why does this work?? What is the key invariant here??
The key intuition (and proof sketch) is as follows:
If u
is above v
in the stack, then either u
and v
are in the same component (so the relative ordering does not matter anyway) XOR there is NO path from v
→ u
(i.e., exactly one of these is true).
This must be true because if there were a path from v
→ u
but they were not in the same component, then that means there is no path from u
→ v
. Then, it’s impossible for u
to finish being processed after v
. Why??
Because if you visit v
, and u
is already finished processing, then u
is already in the stack while v
is not → so v
is above u
. If u
is not on the stack but is marked visited while you are at v
, this means there’s a path from u
→ v
, but this would mean that they’re in the same component (since by our hypothesis, there is also a path from v
→ u
). So, the only remaining case is that u
is not visited when you arrive at v
, and then you would reach u
eventually via the path from v
→ u
, finish u
and push it to the stack before v
.
Now, when we reverse the edges, the above invariant for the reversed graph becomes: If u
is above v
in the stack, then there is either NO path from u
→ v
, XOR u
and v
are in the same component.
Note that reversing all edges doesn’t change which nodes belong to which strongly connected component, since “strongly” implies bidirectional connectivity for any pair of nodes in the same SCC anyway.
This is why, in the second DFS-pass, if we start at u
and reach v
, it must mean that they’re in the same SCC. Why?
Since there is a u
→ v
path in the reversed graph (hence, v
→ u
in the original graph), AND u
is above v
in the stack, the only way this can happen is if there was also a u
→ v
path in the original graph, as we’ve shown in (1). So, there is a path from u
→ v
as well as v
→ u
in the original graph, and hence, they must be in the same SCC.
The invariant of the algorithm is this (The Stack Invariant):
At the end of the call that visits v
and its descendants, we know whether v
itself has a path to any node earlier on the stack. If so, the call returns, leaving v
on the stack to preserve the invariant. If not, then v
must be the root of its strongly connected component, which consists of v
together with any nodes later on the stack than v
(such nodes all have paths back to v
but not to any earlier node, because if they had paths to earlier nodes then v
would also have paths to earlier nodes which is false). The connected component rooted at v
is then popped from the stack and returned, again preserving the invariant.
Simply put, Tarjan’s algorithm maintains a stack of valid nodes from which to update low-link values from. Nodes are added to the stack as they are explored for the first time. Nodes are removed from the stack each time a complete SCC is found. (There is no path back to the same SCC from which you are leaving - because then you can merge those nodes to form a larger SCC)
Before going into the details, we need to understand what a low-link value of a node is: the low-link value of a node is the smallest (lowest) node index that is reachable from that node (and on the stack) when doing a DFS (including itself).
All nodes with the same low-link value belong to the same SCC.
A node started a SCC if its index is equal to its low-link value.
Each node v
is assigned a unique integer v.index
, which numbers the nodes consecutively in the order in which they are discovered. It also maintains a value v.lowlink
that represents the smallest index of any node on the stack known to be reachable from v
through v
's DFS subtree, including v
itself. Therefore v
must be left on the stack if v.lowlink < v.index
, whereas v must be removed as the root of a strongly connected component if v.lowlink == v.index
. The value v.lowlink
is computed during the depth-first search from v
, as this finds the nodes that are reachable from v
.
The pseudocode is as follows:
The algorithm takes a as input, and produces a of the graph's into the graph's strongly connected components. Each vertex of the graph appears in exactly one of the strongly connected components. Any vertex that is not on a directed cycle forms a strongly connected component all by itself: for example, a vertex whose in-degree or out-degree is 0, or any vertex of an acyclic graph.
The basic idea of the algorithm is this: a depth-first search (DFS) begins from an arbitrary start node (and subsequent depth-first searches are conducted on any nodes that have not yet been found). As usual with depth-first search, the search visits every node of the graph exactly once, declining to revisit any node that has already been visited. Thus, the collection of search trees is a of the graph. The strongly connected components will be recovered as certain subtrees of this forest. The roots of these subtrees are called the "roots" of the strongly connected components. Any node of a strongly connected component might serve as a root, if it happens to be the first node of a component that is discovered by search.
Nodes are placed on a in the order in which they are visited. When the depth-first search recursively visits a node v
and its descendants, those nodes are not all necessarily popped from the stack when this recursive call returns. The crucial is that a node remains on the stack after it has been visited if and only if there exists a path in the input graph from it to some node earlier on the stack. In other words, it means that in the DFS a node would be only removed from the stack after all its connected paths have been traversed. When the DFS will backtrack it would remove the nodes on a single path and return to the root in order to start a new path.