SCC and Tarjan's Algorithm
Last updated
Last updated
Recall that a connected component of an undirected graph is defined as follows: and are said to be in the same connected component if, and only if, there is a path from to .
But in case of directed graphs, it is possible to have a path from to but not from to to . Then, how do we define connectedness?
We define a Strongly Connected Component (SCC) as follows:
A subgraph is an SCC of graph if
there is a path from to and there is a path from to . (Here denotes the vertices in subgraph )
There is no bigger subgraph of that contains as its subgraph and satisfies property (1) - In other words, the SCC should be maximally sized (as big as possible)
Note:
If a graph is “strongly connected”, it has exactly 1 strongly connected component.
Any DAG with vertices has strongly connected components (since there are no cycles! so, even if there is a path from to , it is impossible to find a path from to due to absence of cycles)
This shows that any strongly connected component must have a cycle (or it must be an isolated node) for all its nodes to be reachable from each other.
Another cool property is that the graph of SCCs is a directed acyclic graph. In particular, when each SCC is replaced by one representative node in the graph, the resulting graph is acyclic. This is because different SCCs cannot be part of a cycle - otherwise, they could merge to form a larger SCC since all nodes would be reachable from them.
There are various algorithms to find SCCs. Some of the include:
Tarjan’s algorithm
Kosaraju’s algorithm
Both these algorithms run in time.
The algorithm takes a directed graph as input, and produces a partition of the graph's vertices into the graph's strongly connected components. Each vertex of the graph appears in exactly one of the strongly connected components. Any vertex that is not on a directed cycle forms a strongly connected component all by itself: for example, a vertex whose in-degree or out-degree is 0, or any vertex of an acyclic graph.
The basic idea of the algorithm is this: a depth-first search (DFS) begins from an arbitrary start node (and subsequent depth-first searches are conducted on any nodes that have not yet been found). As usual with depth-first search, the search visits every node of the graph exactly once, declining to revisit any node that has already been visited. Thus, the collection of search trees is a spanning forest of the graph. The strongly connected components will be recovered as certain subtrees of this forest. The roots of these subtrees are called the "roots" of the strongly connected components. Any node of a strongly connected component might serve as a root, if it happens to be the first node of a component that is discovered by search.
The invariant of the algorithm is this (The Stack Invariant):
Nodes are placed on a stack in the order in which they are visited. When the depth-first search recursively visits a node v
and its descendants, those nodes are not all necessarily popped from the stack when this recursive call returns. The crucial invariant property is that a node remains on the stack after it has been visited if and only if there exists a path in the input graph from it to some node earlier on the stack. In other words, it means that in the DFS a node would be only removed from the stack after all its connected paths have been traversed. When the DFS will backtrack it would remove the nodes on a single path and return to the root in order to start a new path.
At the end of the call that visits v
and its descendants, we know whether v
itself has a path to any node earlier on the stack. If so, the call returns, leaving v
on the stack to preserve the invariant. If not, then v
must be the root of its strongly connected component, which consists of v
together with any nodes later on the stack than v
(such nodes all have paths back to v
but not to any earlier node, because if they had paths to earlier nodes then v
would also have paths to earlier nodes which is false). The connected component rooted at v
is then popped from the stack and returned, again preserving the invariant.
Simply put, Tarjan’s algorithm maintains a stack of valid nodes from which to update low-link values from. Nodes are added to the stack as they are explored for the first time. Nodes are removed from the stack each time a complete SCC is found. (There is no path back to the same SCC from which you are leaving - because then you can merge those nodes to form a larger SCC)
Before going into the details, we need to understand what a low-link value of a node is: the low-link value of a node is the smallest (lowest) node index that is reachable from that node (and on the stack) when doing a DFS (including itself).
All nodes with the same low-link value belong to the same SCC.
A node started a SCC if its index is equal to its low-link value.
Each node v
is assigned a unique integer v.index
, which numbers the nodes consecutively in the order in which they are discovered. It also maintains a value v.lowlink
that represents the smallest index of any node on the stack known to be reachable from v
through v
's DFS subtree, including v
itself. Therefore v
must be left on the stack if v.lowlink < v.index
, whereas v must be removed as the root of a strongly connected component if v.lowlink == v.index
. The value v.lowlink
is computed during the depth-first search from v
, as this finds the nodes that are reachable from v
.
The pseudocode is as follows: