LINQ: Immutability vs. Deferred execution

The last couple of nights I've been playing with some Linq to Sql and a whole lot of Linq to Objects and I have to say where coming up with complex Regular Expressions used to be one of my favorite puzzles, coming up with complex projections and transformations through Linq is quickly taking its place. Simple Linq is well documented, but when it comes to aggregation, it's a lot sparser. I expect to write more of that up once I feel more comfortable with the syntax.

In the meantime, I wanted to write up some non-obvious observation about deferred execution with Linq. Considering the gotchas with lambdas, it's easy to extend the lessons learned to linq, since it is after all deferred execution. But what's different with Linq is that, while execution is deferred, the expression tree built via a query is also immutable. I came across this trying to do some simple query re-use.

Let's start with a simple DTO:

public class Order
{
  public Order(int id, int val, bool buyOrder)
  {
    Id = id;
    Value = val;
    IsBuyOrder = buyOrder;
  }
  public int Id { get; set; }
  public int Value { get; set; }
  public bool IsBuyOrder { get; set; }
}

And a set of this data:

Order[] orders = new Order[]
{
  new Order(1,2,true),
  new Order(2,2,false),
  new Order(3,4,true),
  new Order(4,4,false),
  new Order(5,6,true),
  new Order(6,6,false),
};

Let's split those into buy and sell orders:

var buyOrders = from order in orders
          where order.IsBuyOrder
          select order;

var sellOrders = from order in orders
                 where !order.IsBuyOrder
                 select order;

If we want to find the buy and the sell order with a value of 2, you'd think we could write one query and re-use it for both of those queries. Since both queries results in IEnumerable, how about we define a query source and assign the value of either above query.

IEnumerable<Order> orders2 = null;

var orderAtTwo = from order in orders2
                 where order.Value == 2
                 select order;

orders2 = buyOrders;
int buyOrderId = orderAtTwo.First().Id;

orders2 = sellOrders;
int sellOrderId = orderAtTwo.First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);

Since the query is deferred until we call .First() on it, that seems like a reasonable syntax. Except this will result in an System.ArgumentNullException because our query grabbed a reference to orders2 at query definition, even though the query won't be executed until later. Giving orders2 a new value does not change the original reference in the immutable expression tree.

A way around this is to replace the actual contents of orders2. However, for us to do that, we have to turn it into the query source into a collection first.

orders2.Clear();
orders2.AddRange(buyOrders);
int buyOrderId = orderAtTwo.First().Id;

orders2.Clear();
orders2.AddRange(sellOrders);
int sellOrderId = orderAtTwo.First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);

This gives us the expected

buy Id: 1, sell Id: 2

Let's put aside the awkwardness of clearing out a list and stuffing data back in, this code has another unfortunate sideeffect. .AddRange() actually executes the query passed to it, so we execute our buy and sell queries to populate orders2 and then execute orderAtTwo twice against those collections. The beauty of linq is that if you create a query from a query, your not running multiple queries, but building a more complex query to be executed. So, what we really want is query "re-use" that results in single expression trees at execution time.

To achieve this, we need to move the shared query into a separate method such as:

private IEnumerable<Order> GetTwo(IEnumerable<Order> source)
{
  return from order in source
         where order.Value == 2
         select order;
}

and the code becomes:

int buyOrderId = GetTwo(buyOrders).First().Id;
int sellOrderId = GetTwo(sellOrders).First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);

This gives the same output as above, and we're only running two queries, each against the original collection. The method call means that we don't get to re-use an expression tree, since it builds a new one, combining the expression tree passed to it with the one it builds itself.