ILoggable

A place to keep my thoughts on programming

 Subscribe

geekblog
[at]
claassen [dot] net

Powered by Blogger

Friday, February 01, 2008

LINQ: Immutability vs. Deferred execution

The last couple of nights I've been playing with some Linq to Sql and a whole lot of Linq to Objects and I have to say where coming up with complex Regular Expressions used to be one of my favorite puzzles, coming up with complex projections and transformations through Linq is quickly taking its place. Simple Linq is well documented, but when it comes to aggregation, it's a lot sparser. I expect to write more of that up once I feel more comfortable with the syntax.

In the meantime, I wanted to write up some non-obvious observation about deferred execution with Linq. Considering the gotchas with lambdas, it's easy to extend the lessons learned to linq, since it is after all deferred execution. But what's different with Linq is that, while execution is deferred, the expression tree built via a query is also immutable. I came across this trying to do some simple query re-use.

Let's start with a simple DTO:

public class Order
{
  public Order(int id, int val, bool buyOrder)
  {
    Id = id;
    Value = val;
    IsBuyOrder = buyOrder;
  }
  public int Id { get; set; }
  public int Value { get; set; }
  public bool IsBuyOrder { get; set; }
}
And a set of this data:
Order[] orders = new Order[]
{
  new Order(1,2,true),
  new Order(2,2,false),
  new Order(3,4,true),
  new Order(4,4,false),
  new Order(5,6,true),
  new Order(6,6,false),
};

Let's split those into buy and sell orders:

var buyOrders = from order in orders
          where order.IsBuyOrder
          select order;

var sellOrders = from order in orders
                 where !order.IsBuyOrder
                 select order;

If we want to find the buy and the sell order with a value of 2, you'd think we could write one query and re-use it for both of those queries. Since both queries results in IEnumerable<Order>, how about we define a query source and assign the value of either above query.

IEnumerable<Order> orders2 = null;

var orderAtTwo = from order in orders2
                 where order.Value == 2
                 select order;

orders2 = buyOrders;
int buyOrderId = orderAtTwo.First().Id;

orders2 = sellOrders;
int sellOrderId = orderAtTwo.First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
Since the query is deferred until we call .First() on it, that seems like a reasonable syntax. Except this will result in an System.ArgumentNullException because our query grabbed a reference to orders2 at query definition, even though the query won't be executed until later. Giving orders2 a new value does not change the original reference in the immutable expression tree.

A way around this is to replace the actual contents of orders2. However, for us to do that, we have to turn it into the query source into a collection first.

orders2.Clear();
orders2.AddRange(buyOrders);
int buyOrderId = orderAtTwo.First().Id;

orders2.Clear();
orders2.AddRange(sellOrders);
int sellOrderId = orderAtTwo.First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
This gives us the expected
buy Id: 1, sell Id: 2
Let's put aside the awkwardness of clearing out a list and stuffing data back in, this code has another unfortunate sideeffect. .AddRange() actually executes the query passed to it, so we execute our buy and sell queries to populate orders2 and then execute orderAtTwo twice against those collections. The beauty of linq is that if you create a query from a query, your not running multiple queries, but building a more complex query to be executed. So, what we really want is query "re-use" that results in single expression trees at execution time.

To achieve this, we need to move the shared query into a separate method such as:

private IEnumerable<Order> GetTwo(IEnumerable<Order> source)
{
  return from order in source
         where order.Value == 2
         select order;
}
and the code becomes:
int buyOrderId = GetTwo(buyOrders).First().Id;
int sellOrderId = GetTwo(sellOrders).First().Id;

Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
This gives the same output as above, and we're only running two queries, each against the original collection. The method call means that we don't get to re-use an expression tree, since it builds a new one, combining the expression tree passed to it with the one it builds itself.

Labels: , ,

Friday, January 11, 2008

More Deferred Execution Fun: foreach and delegation scope

This is closely related to my last post on deferred execution gotchas and its basically more "if you inline delegated code, you may easily overlook scope side-effects". This time it's about dealing with foreach and using the local each item for deferred execution.
public void SpawnActions()
{
  foreach (ActionContext context in contexts)
  {
    int id = context.Id;
    Action<int> callback =  
      (workerNumber) =>
      {
        Console.WriteLine("{0} Id: {1}/{2}", workerNumber, id, context.Id);
      };
    ThreadPool.QueueUserWorkItem(new WaitCallback(FutureExecute), callback);
  }
}

public void FutureExecute(object state)
{
  int id = worker++;
  Action<int> callback = state as Action<int>;
  Thread.Sleep(500);
  callback(id);
}

The output looks like this:

0 Id: 0/9
1 Id: 1/9
2 Id: 2/9
3 Id: 3/9
4 Id: 4/9
5 Id: 5/9
6 Id: 6/9
7 Id: 7/9
8 Id: 8/9
9 Id: 9/9
So while the foreach scope variable context is kept alive for the deferred execution, it turns out that foreach re-uses the variable on each pass through the loop and therefore when the Action is later executed, each one has a reference to the last context. So what we need to do is create a true local variable for context so that the lambda's scope can hold on to the reference we want.
  foreach (ActionContext context in contexts)
  {
    int id = context.Id;
    // locally scoped variable
    ActionContext c2 = context;
    Action<int> callback =  
      (workerNumber) =>
      {
        Console.WriteLine("{0} Id: {1}/{2}", workerNumber, id, c2.Id);
      };
    ThreadPool.QueueUserWorkItem(new WaitCallback(FutureExecute), callback);
  }

And now our results are a bit more what we expected:

0 Id: 0/0
1 Id: 1/1
2 Id: 2/2
3 Id: 3/3
4 Id: 4/4
5 Id: 5/5
6 Id: 6/6
7 Id: 7/7
8 Id: 8/8
9 Id: 9/9

Labels: , , , ,

Sunday, December 30, 2007

The dangers of deferred execution

I recently wrote about Action & Func, which along with Lambda expression let you do easy inline callbacks like this:
Utility.ActionDownloader.Download(
  Configuration.GetAssetUri(dto.Url),
  (Downloader d) =>
  {
    FloatContainer c = (FloatContainer)XamlReader.Load(d.ResponseText);
    c.Initialize(dto);
  });
i.e. I can call a downloader and inline pass it a bit of code to execute once the download completes. But the catch of course is that looking at the code, and following the usual visual tracing of flow hides the fact that c.Initialize(dto) doesn't get called until some asynchronous time in the future. Now, that's always been a side-effect of delegates, but until they became anonymous and inline, the visual deception of code that looks like it's in the current flow scope but isn't wasn't there.

What happened was that I needed my main routine to execute some code after FloatContainer was initialized, and by habit i created an Initialized event on FloatContainer. Of course this was superfluous, since my lambda expression called the synchronous Initialize, i.e my action could be placed inline after that call to c.Initialize(dto) and be guaranteed to be called after initialization had completed.

This scenario just meant I created some superfluous code. However, I'm sure as I use lambda expression more, there will be more pitfalls of writing code that doesn't consider that its execution time is unknown, as is the state of the objects tied to the scope of the expression.

This last bit about objects tied to the expression scope is especially tricky and I think we will see some help in terms of Immutable concepts weaving their way into C# 3.x or 4.0, as the whole functional aspect of lambda expressions really work best when dealing with objects that cannot change state. Eric Lippert's been laying the groundwork in a number of posts on the subject and while he constantly disclaims that his ponderings are not a roadmap for C#, I am still going to assume that his interest and recognition of the subject of Immutables will have some impact in a future revision of the language. Well, I at least hope it does.

Labels: , , ,

Thursday, December 13, 2007

Action & Func: Never write another delegate

With lambda expressions in C#, the Func generic delegate and it's variations have been getting a lot of attention. So naturally, you might think that the lambda syntax is just a shortcut for creating anonymous delegates, whether they return values or not.

First let's look at the evolution of delegates from 1.1 to now. Delegates, simply are the method equivalent of function pointers. They let you pass a method call as an argument for later execution. The cool thing (and a garbage collection pitfall) is that a delegate creates a lexical closure, i.e. the delegate carries with it the object that the method gets called on. For garbage collection this means that a delegate prevents an object from being collection. That's why it's important to unsubscribe from those events you subscribed to.

But I digress. Let's define a delegate that returns an Integer and a method that matches that delegate:

delegate int IntProducerDelegate();

public int x = 0;
public int IntProducer()
{
  return x++;
}

With the original .NET 1.0 syntax we'd create the delegate like this:

IntProducerDelegate p1 = new IntProducerDelegate(IntProducer);

Now we can call p1() and get an integer back, and since it's closure, each time we call p1() the originating objects x increases as does our return value.

Then, in .Net 2.0 we got anonymous delegates.

IntProducerDelegate p2 = delegate { return IntProducer(); };

// or with IntProducer's action inlined...
IntProducerDelegate p3 = delegate { return x++; };

This got rid of the need to create a method just to pass along a closure that manipulated our object at a later time. The other thing that anonymous delegates re-inforce is that delegates just care about signature. IntProducerDelegate can get assigned any delegate that takes no argument and returns an int. That sounds like a perfect scenario for a delegate and in .NET 3.5, we got just that, a set of generic delegates called Func. Using Func, we quickly get to our lambda expression replacing the original delegate syntax like this:

// create a new Func delegate just like the IntProducerDelegate
IntProducerDelegate p3 = new Func<int>(IntProducer);

// which means that we don't need IntProducerDelegate at all anymore
Func<int> p4 = delegate { return x++; };

// and the anonymous delegate can also be shorthanded with a lambda expression
Func<int> p5 = () => { return x++; };
// which says, given that we take no argument "()", execute and return the following "return x++;"

However, before there ever was Func, .Net 2.0 introduced the generic delegate Action, which is a natural counterpart to Func, encapsulating a method that does not return anything. Following through the example of the producer, we'll create a consumer like:

delegate void IntConsumerDelegate(int i);

public void IntConsumer(int i)
{
  Console.WriteLine("The number is {0}", i);
}

Now following the same evolution of syntax we get this:

IntConsumerDelegate c1 = new IntConsumerDelegate(IntConsumer);

IntConsumerDelegate c2 = new Action<int>(IntConsumer);

Action<int> c3 = delegate(int i) { Console.WriteLine("The number is {0}", i); };

Action<int> c4 = (i) => { Console.WriteLine("The number is {0}", i); };

So lambda syntax can be used to create either a Func or an Action. And that also means that we never have to explicitly need to create another delegate, being able to use a variation of these two generic delegates as our arsenal for storing lambda expressions of all kinds.

Labels: , , ,

Wednesday, November 28, 2007

Searching a Tree of Objects with Linq

I've finally had legitimate use for LINQ to Objects, not just to make the syntax cleaner, but also to simplify the underlying code and provide me a lot of flexibility without significant work.

The scenario

I have a tree of objects that have both a type and a name. The name is unique, the Type is not. The interface is this:

public interface INode
{
  int Id { get; }
  string Name { get; set; }
  string Type { get; set; }
  List<INode> Children { get; }
}

I want to be able to find a single named Node in the tree and I want to be able to retrieve a collection of all nodes for a particular type. The searchable interface could be expressed as this:

public interface ISearchableNode : INode
{
  INode FindByName(string name);
  IEnumerable<INode> FindByType(string name);
}
Both require me to walk the tree and examine each node, so clearly I just want to have one walk routine and generically evaluate the node in question. In C# 2.0 parlance, that means I could pass an anonymous delegate into my generic find routine and have it recursively iterate through all the children. I also pass along a resultset to be populated.

The signature for the evaluation delegate looks like this:

delegate bool FindDelegate(INode node);
but since I'm using C# 3.0 (i.e. .NET 3.5) I can use lambda expressions to avoid creating a delegate and simplify my syntax. Instead of FindDelegate, I can simply use Func<INode,bool>:
// Instead of this:
private void Find(INode node, List<INode> resultSet, FindDelegate findDelegate);
// called like this for a Name search:
Find(this, resultSet, delegate(INode node) { return node.Name == name; });

// I can use this:
private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
// called like this:
Find(this, resultSet, node => node.Name == name);

Thus giving me the following implementation for ISearchableNode:

public INode FindByName(string name)
{
  List<INode> resultSet = new List<INode>();
  Find(this, resultSet, x => x.Name == name);
  return resultSet.FirstOrDefault();
}

public IEnumerable<INode> FindByType(string type)
{
  List<INode> resultSet = new List<INode>();
  Find(this, resultSet, x => x.Type == type);
  return (IEnumerable<INode>)resultSet;
}

private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
{
  if (f(node))
  {
    resultSet.Add(node);
  }
  foreach (INode child in node.Children)
  {
    Find(child, resultSet, f);
  }
}

Problem solved, move on... Well, except there is significant room for improvement. Here are the two main issues that ought to be resolved:

  1. Syntax is limited to two types of searches and exposing the generic find makes for an ugly syntax. It would be much nicer if queries to the tree could be expressed in LINQ syntax.
  2. It's also inefficient for the Name search, since I'm walking the entire tree, even if the first node matched the criteria.

LINQ to Hierarchical Data

In order to use LINQ to objects, I need to either create a custom query provider or implement IEnumerable. The latter is significantly simpler and could be expressed using the following interface:

public interface IQueryableNode : IEnumerable<INode>
{
}
Ok, ok, I don't even need an interface, I could just implement IEnumerable<INode>... But what does that actually mean? In the simplest sense, I'm iterating over the node's children, however, I also with do descend into the children's children and so on. So a simple foreach won't do. I could just do the same tree walking with a resultset as I did above and return the Enumerator of the resulting list to implement the interface, but C# 2.0 introduced a much more useful way to implement non-linear Iterators, i.e. the yield keyword. Instead of building a list to be interated over, yield let's the iterating code return values as they are found, which means it can be used for recursive iteration. Thus the GetEnumerator is implemented simply as follows
#region IEnumerable<Node> Members
public IEnumerator<INode> GetEnumerator()
{
  yield return this;
  foreach (Node child in Children)
  {
    foreach (Node subchild in child)
    {
      yield return subchild;
    }
  }
}
#endregion

#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
  return this.GetEnumerator();
}
#endregion
Nice and simple and ready for LINQ.

Searching for all Nodes of a type becomes

var allBar = from n in x
                where n.Type == "bar"
                select n;
foreach (Node n in allBar)
{
  // do something with that node
}
and the search for a specifically named node becomes
INode node = (from n in x
              where n.Name == "i"
              select n).FirstOrDefault();
But the real benefit of this approach is that I don't have hard-coded search methods, but can express much more complex queries in a very natural syntax without any extra code on the Node.

Deferred execution

As it turns out, using yield for the recursive iteration also solved the second issue. As yield returns values as it encounters them during iteration, the search doesn't happen until the query is executed. And one of the side effects of LINQ syntax is that creating a query does not execute it until the result set is iterated over. Therefore, FirstOrDefault() actually short-circuits the query as soon as the first match (and in case of Name, it's going to be the only match) is hit.

Labels: , , , , ,