DAG and Topological Sort
Problem
A directed graph is a graph in which the edges are unidirectional. That is, there is an edge from to (it does not say anything about a way to get from to ).
For each node, we define the in-degree of a node to be the number of incoming edges, and the out-degree of a node to be the number of outgoing edges.
Similar to an undirected graph, we can use an adjacency matrix or adjacency list to represent it. In case of an adjacency list, the node is in the linked list at index of the array if there is a directed edge from to . That is, the linked list stores the outgoing edges of each node.
In case of a matrix, if is the adjacency matrix of a directed graph , then , i.e., there is a directed edge from to .
Like before, an adjacency list takes space while an adjacency matrix takes space
A directed acyclic graph (DAG) is a directed graph with no cycles.
One can view a directed acyclic graph as a partial order.
Recall that a partial order is a relation that is
Reflexive ( )
Antisymmetric ( ),and
Transitive ( )
A total order is a partial order in which all the elements are comparable:
In other words, if a partial order describes a sequence of events with dependencies such that one event should take place before another event if there is an edge from the event to the other event, then it can be represented as a DAG. It is natural to ask, then, given all the constraints on the ordering of events, is there a possible order that respects all these constraints? That is, given a partial order (the order in which the events can be performed) (in the form of a DAG), we want to find a total order (the actual order in which you will perform those events). This total order is a topological order.
It should be obvious why there cannot be a cycle - a cycle would denote a cyclic dependency and so, there is no way to perform any of the events, since you’d need to do the other one before, and the same is true for the other events in the cycle. For example, if CS1101S has CS2040S as a prerequisite and CS2040S has CS1101S as a prerequisite, there is no way (order) in which we can do the modules while respecting the prerequisites.
A topological sort of a DAG is not unique since there can be multiple total orders for a partial order. More generally, if there is a set of constraints regarding order you need to follow while performing a sequence of steps, you can still do those steps in multiple orderings.
A topological ordering of a DAG can be visually interpreted as arranging all the nodes in a horizontal line such that the edges only point forward (towards the right).
There are 2 algorithms that we can use to find a topological sort
Post-order DFS
Kahn’s algorithm
Post-order DFS
Process each node when it is last visited. This builds up the sequential ordering of tasks backwards. Try out a few examples to convince yourself that the algorithm works. When it reaches a node with no outgoing edges (base case of DFS), the algorithm appends it to the end of the schedule. Then it backtracks to find the tasks that need to be done before that. It prepends nodes and hence, the nodes that need to be done last are put in the schedule first (and they get pushed down each time another node is prepended)
Since this is basically DFS, it takes to find a topological ordering.
Kahn’s algorithm
Kahn’s algorithm does not use DFS. In fact, you can think of it as using a variant of BFS since it goes “level” by “level”.
Moreover, it builds the schedule starting from the first to the last (unlike Post-order DFS which builds it backwards)
Repeat until the graph is empty:
Find S = the set of all nodes with no incoming edges (this forms the stuff that you can directly do - put in the front of the schedule. These have no prerequisites).
Append all nodes in S to the the end of the schedule. (you are free to do these events now since you have done all the things you need to do before them)
Remove all edges adjacent to nodes in S
Remove nodes S from the graph (and so you move to the next level of nodes that you can freely do as their prerequisites have also been completed)
Kahn’s algorithm, like post-order DFS, also runs in time. This is because, each node is processed exactly once (when it is appended to the schedule) and each edge is processed exactly once (when you are finding the edges adjacent to nodes in S to remove).
However, since nodes are being removed from the graph, you don’t need to maintain a visited array. On the other hand, you need some data structure to store the indegrees of each node (perhaps, a priority queue so you can get the node with least indegree - in fact, you need the indegree to be exactly 0)
Actually a priority queue would not be efficient. Rather we can just use an ordinary queue (or even a stack works) to store the current set of nodes with indegree 0 and an array to store the current indegree of each node. First traverse the graph to find the nodes with indegree = 0 and add them to the queue. Then each time you dequeue a node, look at all the outgoing edges and subtract 1 from it's neighbors’s indegree in the array. If the indegree becomes 0, add it to the queue. If not, move on.
It can be shown that every DAG has a node with indegree = 0 and another node with outdegree = 0. (think about what would happen if this weren’t the case - there has to be some way to start the schedule, and some node to finish at. If not, it would give rise to cycles.)
Longest Path
Given a graph, how would we find the longest path between 2 nodes? In fact, this problem is NP-hard. There is no better solution we can come up with apart from brute force. However, for a DAG, we can use a neat trick to help us. Multiply all the edge weights by -1. Then, find shortest path using DAG_SSSP (topological sort followed by relaxing edges). Multiply all the edge weights by -1 again and return the absolute value of the length of the shortest path we obtained.
This does not work in general because it is possible for a cycle to exist within a graph. Then, when -1 is multiplied to all weights, it may become a negative weight cycle. In such a case, shortest paths are not defined since you can keep going in loops in the cycle and get smaller and smaller distances.
Last updated