Graph Traversals: Depth-First, Breadth-First...
A graph traversal is the process of systematically visiting each node in a graph. This can be tricky because many graphs have cycles (loops).
The two most common graph traversal algorithms are breadth-first search (BFS) and depth-first search (DFS). Both of these traversal algorithms can be viewed as refinements of the tricolor algorithm, ones in which choices are made in a way that leads to their differing behavior.
The Tricolor Algorithm
(Heavily from https://andrewcmyers.github.io/oodds/lecture.html?id=traversals)
Abstractly, graph traversal can be expressed in terms of the tricolor algorithm due to Dijkstra and others. In this algorithm, graph nodes are assigned one of three colors that can change over time:
- White nodes are undiscovered nodes that have not been seen yet in the current traversal and may even be unreachable.
- Black nodes are reachable nodes that the algorithm has already visited and is finished processing.
- Gray nodes are nodes that have been discovered but that have not yet been visited yet. These nodes form a frontier between white and black. (Extending the frontier analogy, the black nodes are sometimes called the settled nodes.)
The progress of the algorithm is depicted by... . Initially there are no black nodes and the root is gray. As the algorithm progresses, white nodes turn into gray nodes and gray nodes turn into black nodes. Eventually there are no gray nodes left and the algorithm is done. All remaining nodes are either white or black, with the black nodes consisting of exactly the nodes reachable from the roots.
The algorithm maintains a key two-part invariant, called the black–white invariant or tricolor invariant:
- The root is not white.
- There is no edge from a black node to a white node.
Because there is no edge from a black node to a white node, any edge exiting a black node goes either to a black node or to a gray frontier node.
This invariant clearly holds initially, and because it is true at the end, we know that any remaining white nodes cannot be reached from the black nodes.
The algorithm pseudo-code is as follows:
- Color the root node gray and all other nodes white;
-
While some gray node g exists:
- Color some white successors of g gray;
If g has no white successors, optionally color black.
This algorithm is abstract enough to describe many different graph traversals. It allows the particular implementation to choose the node n from among the gray nodes; it allows choosing which and how many white successors to color gray; and it allows delaying the coloring of gray nodes black. We say that such an algorithm is nondeterministic because its next step is not uniquely defined. However, as long as it does some work on each gray node that it picks, any implementation that can be described in terms of this algorithm will finish. Moreover, because the black–white invariant is maintained, it must reach all reachable nodes in the graph.
One advantage of defining graph search in terms of the tricolor algorithm is that the tricolor algorithm works even when gray nodes are processed concurrently, as long as the black–white invariant is maintained. The invariant ensures that whatever graph traversal we choose, the algorithm will work even when different gray nodes are handled concurrently.
Relying on a distance Field
(Heavily from https://andrewcmyers.github.io/oodds/lecture.html?id=traversals)
We can compute distances explicitly during the traversal by replacing the colors with a distance field in each vertex, as in:
frontier = new Queue();
frontier.push(root);
root.distance = 0;
for (v != root)
v.distance = ∞;
while (frontier not empty) {
g = frontier.pop();
foreach (g → v) {
if (v.distance == ∞) {
frontier.push(v);
v.distance = g.distance + 1;
}
} // foreach
}
Here the white nodes are those whose distance field is ∞, the gray nodes are those whose distance field is finite and that are in frontier, and the black nodes are those whose distance field is finite and that are not in frontier. When a new white node is discovered, its distance is set to a finite value one greater than its predecessor, and it is pushed onto the queue; this corresponds to turning the node from white to gray. All nodes on the queue are gray.
When a node v is popped off the queue,
- In breadth-first search, we look at all its successors; any that are white are colored gray and pushed onto the queue.
- In depth-first search, we take the
first
unvisited or white successor; we color it gray and push it onto the queue.
When done, v changes from gray to black.
In breadth-first search frontier is a first-in, first-out (FIFO) queue. The final distance value of a node is the length of the minimum path from a root to that node. All the nodes on the queue have distance within one of each other.
In breadth-first search, there is a set of nodes to be popped off, at some distance k from the source, and another set of elements, later on the queue, at distance k+1. Every time a new node is pushed onto the queue, it is at distance k+1 until all the nodes at distance k are gone, and k then increases by one.
Depth-First Search
A depth first search (function) performs a depth-first traversal of the vertices in a directed graph. When possible, a depth-first traversal chooses a vertex adjacent to the current vertex to visit next. If all adjacent vertices have already been discovered, or there are no adjacent vertices, then the algorithm backtracks to the last vertex that had undiscovered neighbors. Once all reachable vertices have been visited, the algorithm selects from any remaining undiscovered vertices and continues the traversal. The algorithm finishes when all vertices have been visited.
Depth-first search is useful for categorizing edges in a graph, and for imposing an ordering on the vertices.
Implementing Depth-First Search
We use two data structures to implement Depth-First Search: a stack and a set.
Firstly, a stack is used to allow backtracking. Whenever we visit a node, we push it (or its ID) onto the stack. When we want to backtrack, we call pop() on the stack (remove its top element).
Secondly, a set (visited) is used to avoid revisiting nodes in graphs with cycles. Before visiting a node, and pushing it onto the stack, we check that it is not already in the visited set. Conversely, when we do visit a node, we also add it into the visited set.
Similar to breadth-first search, color markers (or some other analogous scheme) are often used to keep track of which vertices have been discovered. White marks vertices that have yet to be discovered, gray marks a vertex that is discovered but still has vertices adjacent to it that are undiscovered. A black vertex is discovered vertex that is not adjacent to any white vertices. This could be implemented through a map from nodes to colours, such as a std::map<int, colour> in C++, where the key is a numerical identifier for vertices and the value is a colour, possibly an enum type.
This map is a substitute for a visited set and simplifies classifying edges into tree, forward, back and cross edges, as discussed below.
Tree, Forward, Back, and Cross Edges (in Depth-First Search)
(From https://www.geeksforgeeks.org/dsa/tree-back-edge-and-cross-edges-in-dfs-of-graph/)
- Tree Edge
- It is an edge which is present in the tree obtained after applying DFS on the graph.
- Forward Edge
- It is an edge (u, v) such that v is a descendant of u but the edge is not part of the DFS tree.
- Back edge
- It is an edge (u, v) such that v is the ancestor of node u but the edge is not part of the DFS tree.
- Cross Edge
- It is an edge that connects two nodes such that they do not hold either an ancestor or a descendant relationship between them.
Finding Tree, Forward, Back, and Cross Edges (in Depth-First Search)
(From https://www.geeksforgeeks.org/dsa/tree-back-edge-and-cross-edges-in-dfs-of-graph/)
The idea is to perform a Depth-First Search (DFS) traversal of the directed graph while tracking discovery and finish times to classify edges into tree, forward, back, and cross edges based on their relationship with visited nodes and the DFS call stack.
Procedure
- Initialize discovery and finish time arrays, a visited array, and an inStack array to track nodes in the current DFS recursion stack.
- Start DFS from each unvisited node, updating discovery times and marking nodes as visited and in the stack.
- If an adjacent node v is unvisited, classify the edge (u, v) as a tree edge and recursively call DFS on v.
- If v is already visited and present in the recursion stack, classify (u, v) as a back edge, indicating a cycle.
- If v is already visited but not in the recursion stack and discovery[u] < discovery[v], classify (u, v) as a forward edge, meaning v is a descendant of u.
- If v is already visited, not in the stack, and discovery[u] > discovery[v], classify (u, v) as a cross edge, indicating a connection between different DFS trees.
- After processing all neighbors of u, remove u from the stack and update its finish time.
Breadth-First Search*
Breadth-first search is so called because it explores nodes in the order of their distance from the root, where distance is defined as the minimum path length from a root to the node.