tree

Searching a Tree of Objects with Linq, Revisited

A while back, I wrote about searching through a tree using linq to objects. That post was mostly snippets of code about delegates, lambda’s, yield and how it applies to linq — more a technical exploration than an example. So I thought I’d follow it up with concrete extension methods to make virtually any tree searchable by Linq.

Linq, IEnumerable<T>, yield

All that is required to search a tree with Linq is creating a list of all nodes in the tree. Linq to Objects can operate on IEnumerable<T>. Really, Linq to objects is a way of expressing operations we’ve been doing forever in loops with if/else blocks. That means there isn’t any search magic going on, it is a linear traversal of all elements in a set and examining each to determine whether it matches our search criteria.

To turn a tree into a list of node we need to walk and collect all children of every node. A simple task for a recursive list that carries along a list object to stuff every found node into. But there is a better way, using yield to return each item as it is encountered. Now we don’t have to carry along a collection. Iterators using yield implement a pattern in which a method can return more than once. For this reason, a method using yield in C# must return an IEnumerable, so that the caller gets a handle to an object it can traverse the result of the multiple return values.

IEnumerable is basically an unbounded set. This is also the reason why unlike collections, it does not have a Count Property. It is entirely possible for an enumerator to return an infinite series of items.

Together IEnumerable<T> and yield are a perfect match for our problem, i.e. recursively walking a tree of nodes and return an unknown number of nodes.

Two types of Tree Traversal

Depth First

In depth-first traversal, the algorithm will dig continue to dig down a nodes children until it reaches a leaf node (a node without children), before considering the next child of the current parent node.

Breadth First

In breadth-first traversal, the algorithm will return all nodes at a particular depth first before considering the children at the next level. I.e. First return all the nodes from level 1, then all nodes from level 2, etc.

Tree to IEnumerable<T> Extension methods

public static class TreeToEnumerableEx
{
 public static IEnumerable<T> AsDepthFirstEnumerable<T>(this T head, Func<T, IEnumerable<T>> childrenFunc)
 {
  yield return head;
  foreach (var node in childrenFunc(head))
  {
   foreach (var child in AsDepthFirstEnumerable(node, childrenFunc))
   {
    yield return child;
   }
  }
 }

 public static IEnumerable<T> AsBreadthFirstEnumerable<T>(this T head, Func<T, IEnumerable<T>> childrenFunc)
 {
  yield return head;
  var last = head;
  foreach(var node in AsBreadthFirstEnumerable(head,childrenFunc))
  {
   foreach(var child in childrenFunc(node))
   {
    yield return child;
    last = child;
   }
   if(last.Equals(node)) yield break;
  }
 }
}

This static class provides two extension methods that can be used on any object, as long as it’s possible to express a function that returns all children of that object, i.e. the object is a node in some type of tree and has a method or property for accessing a list of its children.

An Example

Let’s use a hypothetical Tree model defined by this Node class:

public class Node
{
 private readonly List<Node> children = new List<Node>();

 public Node(int id)
 {
  Id = id;
 }

 public IEnumerable<Node> Children { get { return children; } }

 public Node AddChild(int id)
 {
  var child = new Node(id);
  children.Add(child);
  return child;
 }

 public int Id { get; private set; }
}

Each node simply contains a list of children and has an Id, so that we know what node we’re looking at. The AddChild() method is a convenience method so we don’t expose the child collection and no node can ever be added as a child twice.

The calling convention for a depth-first collection is:

IEnumerable<Node> = node.AsDepthFirstEnumerable(n => n.Children);

The lambda expression n => n.Children is the function that will return the children of a node. It simply states given n, return the value of the Children property of n. A simple test to verify that our extension works and to show us using the extension in linq looks like this:

[Test]
public void DepthFirst()
{
 // build the tree in depth-first order
 int id = 1;
 var depthFirst = new Node(id);
 var df2 = depthFirst.AddChild(++id);
 var df3 = df2.AddChild(++id);
 var df4 = df2.AddChild(++id);
 var df5 = depthFirst.AddChild(++id);
 var df6 = df5.AddChild(++id);
 var df7 = df5.AddChild(++id);

 // find all nodes in depth-first order and select just the Id of each node
 var IDs = from node in depthFirst.AsDepthFirstEnumerable(x => x.Children)
        select node.Id;

 // confirm that this list of IDs is in depth-first order
 Assert.AreEqual(new int[] { 1, 2, 3, 4, 5, 6, 7 }, IDs.ToArray());
}

For breadth-first collections, the calling convention is:

IEnumerable<Node> = node.AsBreadthFirstEnumerable(n => n.Children);

Again, we can test that the extension works like this:

[Test]
public void BreadthFirst()
{
 // build the tree in breadth-first order
 var id = 1;
 var breadthFirst = new Node(id);
 var bf2 = breadthFirst.AddChild(++id);
 var bf3 = breadthFirst.AddChild(++id);
 var bf4 = bf2.AddChild(++id);
 var bf5 = bf2.AddChild(++id);
 var bf6 = bf3.AddChild(++id);
 var bf7 = bf3.AddChild(++id);

 // find all nodes in breadth-first order and select just the Id of each node
 var IDs = from node in breadthFirst.AsBreadthFirstEnumerable(x => x.Children)
       select node.Id;

 // confirm that this list of IDs is in depth-first order
 Assert.AreEqual(new int[] { 1, 2, 3, 4, 5, 6, 7 }, IDs.ToArray());
}

Searching Trees

The tree used in the example is of course extremely simple, i.e. it doesn’t even have any worthwhile data to query attached to a node. But these extension methods could be used on a node of any kind of tree, allowing the full power of Linq, grouping, aggregation, sorting, projection, etc. to be used on the tree.

As a final note, you may wonder, why bother with depth-first vs. breadth first? After all, in the end we do examine every node! There is however one particular case where the choice of algorithm can be very important: You are looking for one match or a particular number of matches. Since we are using yield, we can terminate the traversal at any time. Using the FirstOrDefault() extension on our Linq expression, the traversal would stop as soon as one match is found. And if have any knowledge where that node might be in the tree, the choice of search algorithm can be a significant performance factor.

Searching a Tree of Objects with Linq

UPDATE: Posted a follow-up here.
I’ve finally had legitimate use for LINQ to Objects, not just to make the syntax cleaner, but also to simplify the underlying code and provide me a lot of flexibility without significant work.

The scenario

I have a tree of objects that have both a type and a name. The name is unique, the Type is not. The interface is this:

public interface INode
{
  int Id { get; }
  string Name { get; set; }
  string Type { get; set; }
  List<INode> Children { get; }
}

I want to be able to find a single named Node in the tree and I want to be able to retrieve a collection of all nodes for a particular type. The searchable interface could be expressed as this:

public interface ISearchableNode : INode
{
  INode FindByName(string name);
  IEnumerable<INode> FindByType(string name);
}

Both require me to walk the tree and examine each node, so clearly I just want to have one walk routine and generically evaluate the node in question. In C# 2.0 parlance, that means I could pass an anonymous delegate into my generic find routine and have it recursively iterate through all the children. I also pass along a resultset to be populated.

The signature for the evaluation delegate looks like this:

delegate bool FindDelegate(INode node);

but since I’m using C# 3.0 (i.e. .NET 3.5) I can use lambda expressions to avoid creating a delegate and simplify my syntax. Instead of FindDelegate, I can simply use Func<INode,bool>:

// Instead of this:
private void Find(INode node, List<INode> resultSet, FindDelegate findDelegate);
// called like this for a Name search:
Find(this, resultSet, delegate(INode node) { return node.Name == name; });

// I can use this:
private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
// called like this:
Find(this, resultSet, node => node.Name == name);

Thus giving me the following implementation for ISearchableNode:

public INode FindByName(string name)
{
  List<INode> resultSet = new List<INode>();
  Find(this, resultSet, x => x.Name == name);
  return resultSet.FirstOrDefault();
}

public IEnumerable<INode> FindByType(string type)
{
  List<INode> resultSet = new List<INode>();
  Find(this, resultSet, x => x.Type == type);
  return (IEnumerable<INode>)resultSet;
}

private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
{
  if (f(node))
  {
    resultSet.Add(node);
  }
  foreach (INode child in node.Children)
  {
    Find(child, resultSet, f);
  }
}

Problem solved, move on… Well, except there is significant room for improvement. Here are the two main issues that ought to be resolved:

  1. Syntax is limited to two types of searches and exposing the generic find makes for an ugly syntax. It would be much nicer if queries to the tree could be expressed in LINQ syntax.
  2. It’s also inefficient for the Name search, since I’m walking the entire tree, even if the first node matched the criteria.

LINQ to Hierarchical Data

In order to use LINQ to objects, I need to either create a custom query provider or implement IEnumerable. The latter is significantly simpler and could be expressed using the following interface:

public interface IQueryableNode : IEnumerable<INode> { }

Ok, ok, I don’t even need an interface, I could just implement IEnumerable<INode>… But what does that actually mean? In the simplest sense, I’m iterating over the node’s children, however, I also with do descend into the children’s children and so on. So a simple foreach won’t do. I could just do the same tree walking with a resultset as I did above and return the Enumerator of the resulting list to implement the interface, but C# 2.0 introduced a much more useful way to implement non-linear Iterators, i.e. the yield keyword. Instead of building a list to be interated over, yield let’s the iterating code return values as they are found, which means it can be used for recursive iteration. Thus the GetEnumerator is implemented simply as follows

#region IEnumerable<Node> Members
public IEnumerator<INode> GetEnumerator()
{
  yield return this;
  foreach (Node child in Children)
  {
    foreach (Node subchild in child)
    {
      yield return subchild;
    }
  }
}
#endregion

#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
  return this.GetEnumerator();
}
#endregion

Nice and simple and ready for LINQ.

Searching for all Nodes of a type becomes

var allBar = from n in x
                where n.Type == "bar"
                select n;
foreach (Node n in allBar)
{
  // do something with that node
}

and the search for a specifically named node becomes

INode node = (from n in x
              where n.Name == "i"
              select n).FirstOrDefault();

But the real benefit of this approach is that I don’t have hard-coded search methods, but can express much more complex queries in a very natural syntax without any extra code on the Node.

Deferred execution

As it turns out, using yield for the recursive iteration also solved the second issue. As yield returns values as it encounters them during iteration, the search doesn’t happen until the query is executed. And one of the side effects of LINQ syntax is that creating a query does not execute it until the result set is iterated over. Therefore, FirstOrDefault() actually short-circuits the query as soon as the first match (and in case of Name, it’s going to be the only match) is hit.

By arne on | .net | 4 comments
Tags: , , , , ,