Graph Traversals: Depth-First, Breadth-First...
A graph traversal is the process of systematically visiting each node in a graph. This can be tricky because many graphs have cycles (loops).
The two most common graph traversal algorithms are breadth-first search (BFS) and depth-first search (DFS). Both of these traversal algorithms can be viewed as refinements of the tricolor algorithm, ones in which choices are made in a way that leads to their differing behavior.
The Tricolor Algorithm
(Heavily from https://andrewcmyers.github.io/oodds/lecture.html?id=traversals)
Abstractly, graph traversal can be expressed in terms of the tricolor algorithm due to Dijkstra and others. In this algorithm, graph nodes are assigned one of three colors that can change over time:
- White nodes are undiscovered nodes that have not been seen yet in the current traversal and may even be unreachable.
- Black nodes are reachable nodes that the algorithm has already visited and is finished processing.
- Gray nodes are nodes that have been discovered but that have not yet been visited yet. These nodes form a frontier between white and black. (Extending the frontier analogy, the black nodes are sometimes called the settled nodes.)
The progress of the algorithm is depicted by... . Initially there are no black nodes and the root is gray. As the algorithm progresses, white nodes turn into gray nodes and gray nodes turn into black nodes. Eventually there are no gray nodes left and the algorithm is done. All remaining nodes are either white or black, with the black nodes consisting of exactly the nodes reachable from the roots.
The algorithm maintains a key two-part invariant, called the black–white invariant or tricolor invariant:
- The root is not white.
- There is no edge from a black node to a white node.
Because there is no edge from a black node to a white node, any edge exiting a black node goes either to a black node or to a gray frontier node.
This invariant clearly holds initially, and because it is true at the end, we know that any remaining white nodes cannot be reached from the black nodes.
The algorithm pseudo-code is as follows:
- Color the root node gray and all other nodes white;
-
While some gray node g exists:
- Color some white successors of g gray;
If g has no white successors, optionally color black.
This algorithm is abstract enough to describe many different graph traversals. It allows the particular implementation to choose the node n from among the gray nodes; it allows choosing which and how many white successors to color gray; and it allows delaying the coloring of gray nodes black. We say that such an algorithm is nondeterministic because its next step is not uniquely defined. However, as long as it does some work on each gray node that it picks, any implementation that can be described in terms of this algorithm will finish. Moreover, because the black–white invariant is maintained, it must reach all reachable nodes in the graph.
One advantage of defining graph search in terms of the tricolor algorithm is that the tricolor algorithm works even when gray nodes are processed concurrently, as long as the black–white invariant is maintained. The invariant ensures that whatever graph traversal we choose, the algorithm will work even when different gray nodes are handled concurrently.
Relying on a distance Field
(Heavily from https://andrewcmyers.github.io/oodds/lecture.html?id=traversals)
We can compute distances explicitly during the traversal by replacing the colors with a distance field in each vertex, as in:
frontier = new Queue();
frontier.push(root);
root.distance = 0;
for (v != root)
v.distance = ∞;
while (frontier not empty) {
g = frontier.pop();
foreach (g → v) {
if (v.distance == ∞) {
frontier.push(v);
v.distance = g.distance + 1;
}
} // foreach
}
Here the white nodes are those whose distance field is ∞, the gray nodes are those whose distance field is finite and that are in frontier, and the black nodes are those whose distance field is finite and that are not in frontier. When a new white node is discovered, its distance is set to a finite value one greater than its predecessor, and it is pushed onto the queue; this corresponds to turning the node from white to gray. All nodes on the queue are gray.
When a node v is popped off the queue,
- In breadth-first search, we look at all its successors; any that are white are colored gray and pushed onto the queue.
- In depth-first search, we take the
first
unvisited or white successor; we color it gray and push it onto the queue.
When done, v changes from gray to black.
In breadth-first search frontier is a first-in, first-out (FIFO) queue. The final distance value of a node is the length of the minimum path from a root to that node. All the nodes on the queue have distance within one of each other.
In breadth-first search, there is a set of nodes to be popped off, at some distance k from the source, and another set of elements, later on the queue, at distance k+1. Every time a new node is pushed onto the queue, it is at distance k+1 until all the nodes at distance k are gone, and k then increases by one.
Depth-First Search
A depth first search (function) performs a depth-first traversal of the vertices in a directed graph. When possible, a depth-first traversal chooses a vertex adjacent to the current vertex to visit next. If all adjacent vertices have already been discovered, or there are no adjacent vertices, then the algorithm backtracks to the last vertex that had undiscovered neighbors. Once all reachable vertices have been visited, the algorithm selects from any remaining undiscovered vertices and continues the traversal. The algorithm finishes when all vertices have been visited.
Depth-first search is useful for categorizing edges in a graph, and for imposing an ordering on the vertices.
Implementing Depth-First Search
We use two data structures to implement Depth-First Search: a stack and a set.
Firstly, a stack is used to allow backtracking. Whenever we visit a node, we push it (or its ID) onto the stack. When we want to backtrack, we call pop() on the stack (remove its top element).
Secondly, a set (visited) is used to avoid revisiting nodes in graphs with cycles. Before visiting a node, and pushing it onto the stack, we check that it is not already in the visited set. Conversely, when we do visit a node, we also add it into the visited set.
Similar to breadth-first search, color markers (or some other analogous scheme) are often used to keep track of which vertices have been discovered. White marks vertices that have yet to be discovered, gray marks a vertex that is discovered but still has vertices adjacent to it that are undiscovered. A black vertex is discovered vertex that is not adjacent to any white vertices. This could be implemented through a map from nodes to colours, such as a std::map<int, colour> in C++, where the key is a numerical identifier for vertices and the value is a colour, possibly an enum type.
This map is a substitute for a visited set and simplifies classifying edges into tree, forward, back and cross edges, as discussed below.
Tree, Forward, Back, and Cross Edges (in Depth-First Search)
(From https://www.geeksforgeeks.org/dsa/tree-back-edge-and-cross-edges-in-dfs-of-graph/)
- Tree Edge
- It is an edge which is present in the tree obtained after applying DFS on the graph.
- Forward Edge
- It is an edge (u, v) such that v is a descendant of u but the edge is not part of the DFS tree.
- Back edge
- It is an edge (u, v) such that v is the ancestor of node u but the edge is not part of the DFS tree.
- Cross Edge
- It is an edge that connects two nodes such that they do not hold either an ancestor or a descendant relationship between them.
Finding Tree, Forward, Back, and Cross Edges (in Depth-First Search)
(From https://www.geeksforgeeks.org/dsa/tree-back-edge-and-cross-edges-in-dfs-of-graph/)
The idea is to perform a Depth-First Search (DFS) traversal of the directed graph while tracking discovery and finish times to classify edges into tree, forward, back, and cross edges based on their relationship with visited nodes and the DFS call stack.
Procedure
- Initialize discovery and finish time arrays, a visited array, and an inStack array to track nodes in the current DFS recursion stack.
- Start DFS from each unvisited node, updating discovery times and marking nodes as visited and in the stack.
- If an adjacent node v is unvisited, classify the edge (u, v) as a tree edge and recursively call DFS on v.
- If v is already visited and present in the recursion stack, classify (u, v) as a back edge, indicating a cycle.
- If v is already visited but not in the recursion stack and discovery[u] < discovery[v], classify (u, v) as a forward edge, meaning v is a descendant of u.
- If v is already visited, not in the stack, and discovery[u] > discovery[v], classify (u, v) as a cross edge, indicating a connection between different DFS trees.
- After processing all neighbors of u, remove u from the stack and update its finish time.
Breadth-First Search
How breadth-first search works
(From https://www.codecademy.com/article/breadth-first-search-bfs-algorithm)
The breadth-first search algorithm works by exploring all neighboring nodes before moving to nodes at the next level of depth. It starts from a chosen source node and visits all nodes directly connected to it, then proceeds level by level until all reachable nodes have been explored. This ensures that nodes are visited in order of their distance from the source.
These are the steps how breadth-first search operates:
- Start from the source node: Choose a starting node to begin traversal.
- Mark the source as visited: Keep a record of visited nodes to avoid repetition or loops.
- Enqueue the source node: Use a queue (FIFO structure) to store nodes to be explored.
- Process the queue: BFS dequeues a node, visits its unvisited neighbors, marks them visited, and enqueues them.
- Move level by level: Continue until the queue is empty, visiting all nodes layer by layer.
This systematic, level-based approach ensures that BFS finds the shortest route from the source to any other node in an unweighted graph.
BFS algorithm pseudocode
The pseudocode for the breadth-first search algorithm outlines the logical steps of the traversal process. It shows how we use a queue to explore nodes level by level, marking each as visited to avoid repetition. This simple structure forms the foundation for implementing BFS in any programming language:
BFS(Graph, start_node):
create a queue Q
mark start_node as visited
enqueue start_node into Q
while Q is not empty:
current_node = dequeue(Q)
for each neighbor of current_node:
if neighbor is not visited:
mark neighbor as visited
enqueue neighbor into Q
Breadth-first search time and space complexity
The time complexity of BFS algorithm depends on the number of nodes and edges in the graph. In the worst case, BFS visits every node and examines every edge once. For a graph having V vertices and E edges, the time complexity is O(V + E). This linear relationship makes BFS efficient for both sparse and dense graphs, ensuring that all reachable nodes are processed systematically.
In terms of space complexity, BFS requires additional memory to keep track of visited nodes and to store nodes in the queue. In the worst case, the queue may contain all nodes at a given level, which can be proportional to the number of vertices. Therefore, the space complexity of BFS is O(V).
Real-world applications of breadth-first search
The breadth-first search algorithm is widely used across various domains:
- Shortest path finding: Used in unweighted graphs like road maps or network routing.
- Web crawlers: Search engines use BFS to crawl websites level by level.
- Social networks: Helps find degrees of connection (e.g., “friends of friends”).
- AI and robotics: BFS aids in state-space search and pathfinding.
These practical applications demonstrate BFS’s versatility and reliability in problem-solving.
Best-First Search (Informed Search)
Best-First Search Algorithm was developed in response to the limitations of earlier search algorithms, such as Breadth-First Search (BFS) and Depth-First Search (DFS).
BFS and DFS are systematic uninformed search algorithms that are not designed to handle large search spaces efficiently. BFS blindly explores all nodes at a given depth level before moving on to the next level, while DFS explores a path to the deepest possible node before backtracking and exploring another path. Best-First Search, on the other hand, relies on a heuristic.
Informed search algorithms are more efficient than uninformed in terms of processor time, memory space and performance.
The idea is to use a priority queue or heap to store edges ordered by their cost as given by an evaluation function. It otherwise operates similarly to the Breadth-First Search algorithm.
The Best-First Search algorithm has limitations, though, such as the potential for getting stuck in local optima and being sensitive to the choice of heuristic function. It also requires a large amount of memory and computational resources, especially in large search spaces, and does not guarantee finding the globally optimal solution in all cases.
Today, it is widely used in many problem domains, including robotics, natural language processing, computer vision, game playing, and optimization. Its ability to find optimal solutions efficiently and its adaptability to a wide range of problem domains make Best-First Search Algorithm a powerful tool in these fields.
Implementation
The best first search method for searching the graph space employs two lists for traversal tracking. An Open
list that keeps track of the current immediate
nodes that are available for traversal and a CLOSED
list that keeps track of the nodes that have already been traversed.
These are the steps in the algorithm:
- Create 2 empty lists: OPEN and CLOSED
- Start from the initial node (say N) and put it in the
ordered
OPEN list - Repeat the next steps until the GOAL node is reached
- If the OPEN list is empty, then EXIT the loop returning
False - Select the first/top node (say N) in the OPEN list and move it to the CLOSED list. Also, capture the information of the parent node
- If N is a GOAL node, then move the node to the Closed list and exit the loop returning
True. The solution can be found by backtracking the path - If N is not the GOAL node, expand node N to generate the
immediate
next nodes linked to node N and add all those to the OPEN list - Reorder the nodes in the OPEN list in ascending order according to an evaluation function f(n)
Time Complexity: This algorithm will traverse the shortest path first in the queue. The time complexity of the algorithm is given by O(n*log n).
Variants of Best First Search Algorithm
One popular variant of best-first search is A* search, which balances the estimated cost of reaching the goal with the actual cost of reaching each node in the search space.
Weighted A* search is a modification of A* search that allows for adjusting the weight given to the heuristic function.
Greedy best-first search is another variant that only considers the estimated cost to the goal and does not take into account the cost of reaching each node. It applies a heuristic that attempts to predict how close the end of a path is to a solution (or, goal), so that paths which are judged to be closer to a solution (or, goal) are expanded first.
Each of these algorithms has its strengths and weaknesses depending on the problem domain and the quality of the heuristic function.
Greedy Best-First Search
Using a greedy algorithm, expand the first successor of the parent. After a successor is generated:
- If the successor's heuristic (its score) is better than its parent, the successor is set at the front of the queue (with the parent reinserted directly behind it), and the loop restarts.
- Else, the successor is inserted into the queue (in a location determined by its heuristic value). The procedure will evaluate the remaining successors (if any) of the parent.
Below is a pseudocode example of this algorithm, where queue represents a priority queue which orders nodes based on their heuristic distances from the goal. This implementation keeps track of visited nodes, and can therefore be used for undirected graphs. It can be modified to retrieve the path.
function GBS(start, target):
mark start as visited
add start to queue
while queue is not empty:
current_node = vertex of queue with min distance to target
remove current_node from queue
foreach neighbor n of current_node:
if n not in visited then:
if n is target:
return n
else:
mark n as visited
add n to queue
return failure
...
Pseudo Code
This id the pseudo-code for the best_first_search(problem, heuristic) function:
function best_first_search(problem, heuristic):
open_list = [problem.initial_state]
while open_list is not empty:
current_node = select_node_with_lowest_heuristic(open_list, heuristic)
if problem.is_goal_state(current_node):
return current_node
successors = problem.get_successors(current_node)
for successor in successors:
open_list.append(successor)
current_node.visited = True
open_list.sort(key=heuristic)
return failure
And this id the pseudo-code for the select_node_with_lowest_heuristic(open_list, heuristic) function:
function select_node_with_lowest_heuristic(open_list, heuristic):
min_heuristic_value = infinity
min_heuristic_node = None
for node in open_list:
if not node.visited and heuristic(node) < min_heuristic_value:
min_heuristic_value = heuristic(node)
min_heuristic_node = node
return min_heuristic_node