SCC (Tarjan's and Kosaraju's)

Problem

Recall that a connected component of an undirected graph is defined as follows: $v$ and $w$ are said to be in the same connected component if, and only if, there is a path from $v$ to $w$ .

But in case of directed graphs, it is possible to have a path from $v$ to $w$ but not from to $w$ to $v$ . Then, how do we define connectedness?

We define a Strongly Connected Component (SCC) as follows:

A subgraph $H$ is an SCC of graph $G$ if

$\forall v, w \in V_H$ there is a path from $v$ to $w$ and there is a path from $w$ to $v$ . (Here $V_H$ denotes the vertices in subgraph $H$ )
There is no bigger subgraph of $G$ that contains $H$ as its subgraph and satisfies property (1) - In other words, the SCC should be maximally sized (as big as possible)

Note:

If a graph is “strongly connected”, it has exactly 1 strongly connected component.
Any DAG with $n$ vertices has $n$ strongly connected components (since there are no cycles! so, even if there is a path from $v$ to $w$ , it is impossible to find a path from $w$ to $v$ due to absence of cycles)

This shows that any strongly connected component must have a cycle (or it must be an isolated node) for all its nodes to be reachable from each other.

Another cool property is that the graph of SCCs is a directed acyclic graph. In particular, when each SCC is replaced by one representative node in the graph, the resulting graph is acyclic. This is because different SCCs cannot be part of a cycle - otherwise, they could merge to form a larger SCC since all nodes would be reachable from them.

There are various algorithms to find SCCs. Some of the include:

Kosaraju’s algorithm
Tarjan’s algorithm

Both these algorithms run in $O(V+E)$ time (though I much prefer Kosaraju's algorithm because it's more elegant).

Kosaraju's Algorithm

It uses 2 DFS passes to determine the SCCs.

Run DFS from each unvisited node, and add it to a stack once you’re done processing it (i.e., store the nodes in reverse post-order manner)

stack = []
visited = set()
for i in range(n):
	if i not in visited:
		dfs(i)

def dfs(node):
	visited.add(node)
	for v in graph[node]:
		if v not in visited:
			dfs(v)
	stack.append(node)

Reverse all the edges of the graph

temp = defaultdict(list)
for node in graph:
	for v in graph[node]:
		temp[v].append(node)
grah = temp

Pop from stack, run dfs and assign nodes to their components if they’re not already done so. Each component is identified by its root / “leader” which is the highest node in the stack (first node in its component to be processed in this step).

visited = set()
component = [-1] * n
while stack:
	curr = stack.pop()
	if curr in visited: continue
	assign(curr, curr)
	
def assign(node, root):
	visited.add(node)
	component[node] = root # node belongs to root's (leader) component
	for v in graph[node]:
		if v not in visited:
			assign(v, root)

The code itself is easy to understand, but the main question is: why does this work?? What is the key invariant here??

The key intuition (and proof sketch) is as follows:

If u is above v in the stack, then either u and v are in the same component (so the relative ordering does not matter anyway) XOR there is NO path from v → u (i.e., exactly one of these is true).
This must be true because if there were a path from v → u but they were not in the same component, then that means there is no path from u → v. Then, it’s impossible for u to finish being processed after v. Why??
Because if you visit v, and u is already finished processing, then u is already in the stack while v is not → so v is above u. If u is not on the stack but is marked visited while you are at v, this means there’s a path from u → v, but this would mean that they’re in the same component (since by our hypothesis, there is also a path from v → u ). So, the only remaining case is that u is not visited when you arrive at v, and then you would reach u eventually via the path from v → u, finish u and push it to the stack before v.
Now, when we reverse the edges, the above invariant for the reversed graph becomes: If u is above v in the stack, then there is either NO path from u → v, XOR u and v are in the same component.
Note that reversing all edges doesn’t change which nodes belong to which strongly connected component, since “strongly” implies bidirectional connectivity for any pair of nodes in the same SCC anyway.
This is why, in the second DFS-pass, if we start at u and reach v, it must mean that they’re in the same SCC. Why?
Since there is a u → v path in the reversed graph (hence, v → u in the original graph), AND u is above v in the stack, the only way this can happen is if there was also a u → v path in the original graph, as we’ve shown in (1). So, there is a path from u → v as well as v → u in the original graph, and hence, they must be in the same SCC.

Tarjan’s Algorithm

The algorithm takes a directed graph as input, and produces a partition of the graph's vertices into the graph's strongly connected components. Each vertex of the graph appears in exactly one of the strongly connected components. Any vertex that is not on a directed cycle forms a strongly connected component all by itself: for example, a vertex whose in-degree or out-degree is 0, or any vertex of an acyclic graph.

The basic idea of the algorithm is this: a depth-first search (DFS) begins from an arbitrary start node (and subsequent depth-first searches are conducted on any nodes that have not yet been found). As usual with depth-first search, the search visits every node of the graph exactly once, declining to revisit any node that has already been visited. Thus, the collection of search trees is a spanning forest of the graph. The strongly connected components will be recovered as certain subtrees of this forest. The roots of these subtrees are called the "roots" of the strongly connected components. Any node of a strongly connected component might serve as a root, if it happens to be the first node of a component that is discovered by search.

The invariant of the algorithm is this (The Stack Invariant):

Nodes are placed on a stack in the order in which they are visited. When the depth-first search recursively visits a node v and its descendants, those nodes are not all necessarily popped from the stack when this recursive call returns. The crucial invariant property is that a node remains on the stack after it has been visited if and only if there exists a path in the input graph from it to some node earlier on the stack. In other words, it means that in the DFS a node would be only removed from the stack after all its connected paths have been traversed. When the DFS will backtrack it would remove the nodes on a single path and return to the root in order to start a new path.

At the end of the call that visits v and its descendants, we know whether v itself has a path to any node earlier on the stack. If so, the call returns, leaving v on the stack to preserve the invariant. If not, then v must be the root of its strongly connected component, which consists of v together with any nodes later on the stack than v (such nodes all have paths back to v but not to any earlier node, because if they had paths to earlier nodes then v would also have paths to earlier nodes which is false). The connected component rooted at v is then popped from the stack and returned, again preserving the invariant.

Simply put, Tarjan’s algorithm maintains a stack of valid nodes from which to update low-link values from. Nodes are added to the stack as they are explored for the first time. Nodes are removed from the stack each time a complete SCC is found. (There is no path back to the same SCC from which you are leaving - because then you can merge those nodes to form a larger SCC)

Before going into the details, we need to understand what a low-link value of a node is: the low-link value of a node is the smallest (lowest) node index that is reachable from that node (and on the stack) when doing a DFS (including itself).

All nodes with the same low-link value belong to the same SCC.

A node started a SCC if its index is equal to its low-link value.

Each node v is assigned a unique integer v.index, which numbers the nodes consecutively in the order in which they are discovered. It also maintains a value v.lowlink that represents the smallest index of any node on the stack known to be reachable from v through v's DFS subtree, including v itself. Therefore v must be left on the stack if v.lowlink < v.index, whereas v must be removed as the root of a strongly connected component if v.lowlink == v.index. The value v.lowlink is computed during the depth-first search from v, as this finds the nodes that are reachable from v.

The pseudocode is as follows:

algorithm tarjan is
    input: graph G = (V, E)
    output: set of strongly connected components (sets of vertices)

    index := 0
    S := empty stack
    for each v in V do
        if v.index is undefined then
            strongconnect(v)
        end if
    end for

    function strongconnect(v)
        // Set the depth index for v to the smallest unused index
        v.index := index
        v.lowlink := index
        index := index + 1
        S.push(v)
        v.onStack := true

        // Consider successors of v
        for each (v, w) in E do
            if w.index is undefined then
                // Successor w has not yet been visited; recurse on it
                strongconnect(w)
                v.lowlink := min(v.lowlink, w.lowlink)
            else if w.onStack then
                // Successor w is in stack S and hence in the current SCC
                // If w is not on stack, then (v, w) is an edge pointing to an SCC already found and must be ignored
                // Note: The next line may look odd - but is correct.
                // It says w.index not w.lowlink; that is deliberate and from the original paper
                v.lowlink := min(v.lowlink, w.index)
            end if
        end for

        // If v is a root node, pop the stack and generate an SCC
        if v.lowlink = v.index then
            start a new strongly connected component
            repeat
                w := S.pop()
                w.onStack := false
                add w to current strongly connected component
            while w ≠ v
            output the current strongly connected component
        end if
    end function

PreviousDAG and Topological Sort NextSSSP (Bellman-Ford and Dijkstra)

Last updated 9 months ago

Was this helpful?