Linq2MongoDB: Building a Linq Provider for MongDB
This weekend has been a
hack-a-thon, trying to build a simple linq provider for
MongoDB. I'm using
Sam Corder, et al.'s excellent C# MongoDB Driver as the query pipeline, so my provider really is just a translator from Linq syntax to Mongo Document Query syntax. I call it a
hack-a-thon, because it's my first linq provider attempt and, boy, is that query translator state machine ugly already. However, I am covering every bit of syntax with tests, so that once i understand it all better, i can rewrite the translator in a cleaner fashion.
My goals for this provider is to replace a document storage layer i've built for a new notify.me project using NHibernate against mysql. This is in no way a judgment against NHibernate. It just happens that for this project, my schema is a heavily denormalized json document database. While fluent NHibernate made it a breeze to let me map it into mysql, it's really an abuse of an RDBMS. It was a case of prototyping with what you know, but now it's time to evaluate whether a document database is the way to go.
Replacing existing NHibernate code does mean that, eventually, i want the provider to work with POCO entities and use a fully strong-typed query syntax. But that layer will be built on top of the string-key based version i'm building right now. The string-key based version will be the primary layer, so that you never loose any of the schema-less flexibility of MongoDB, unless you choose to.
Basic MongoDB queries
So, lacking an entity with named properties to map against, what does the syntax look like right now? First thing we need is an IQueryable<Document> which is created like this:
var mongo = new Mongo();
var queryable = mongo["db"]["collection"].AsQueryable();
Given the queryable, the queries can be built using the
Document indexer like this:
var q = from d in queryable where (string)d["foo"] == "bar" select d;
The
Document returns an
object, which means a cast is unfortunately required on one side of the conditional. Alternatively,
Equals, either the static or instance version, also works, alleviating the need for a cast:
var q = from d in queryable where Equals(d["foo"], "bar") select d;
// OR
var q = from d in queryable where d["foo"].Equals("bar") select d;
Better, but it's not as nice as operator syntax would be, if we could get rid of the casts..
As it turns out there is a number of query operators in MongoDB that don't have an equivalent syntax in Linq, so a helper class to generate query expression was already needed. The helper is instantiated via the Document extension method .Key(key), giving us the opportunity to overload operators for the various types recognized by MongoDB's BSON. This allows for the following conditional syntax:
var q = from d in queryable
where d.Key("type") == "customer" &&
d.Key("created") >= DateTime.Parse("2009/09/27")
d.Key("status") != "inactive"
select d;
IN and NOT IN
In addition to normal conditional operators, the query expression helper class also defines IN and NOT IN syntax:
var in = from d in queryable where d.Key("foo").In("bar", "baz") select d;
var notIn = from d in queryable where d.Key("foo").NotIn("bar", "baz") select d;
The helper will be the point of extension to support more of MongoDB's syntax, so that most query definitions will use the
d.Key(key) syntax.
findOne, limit and skip
Linq has matching counter parts of MongoDB's
findOne(),
limit() and
skip(), in
First or
FirstOrDefault,
Take and
Skip respectively, and the current version of Linq provider already supports them.
What's missing?
There is a lot in Linq that will likely never be supported, since MongoDB is not a relational DB. That means joins, sub-queries, etc. will not covered by the provider. Anything that does map to MongoDB's capabilities, though, will be added over time. The low hanging fruit are
Count() and
order by, with
group by following thereafter.
Surprisingly, || (or conditionals) are not going to happen as fast, since aside from or type queries using the .In syntax, it is not directly supported by MongoDB. In order to perform || queries, the query has to be written as a javascript function, which would basically mean that as soon as a single || shows up in the where clause the query translato would have to rewrite all other conditions in javascript as well. So, that's a bit more on the nice to have end of the spectrum of priorities.
Ready to go!
I will most likely concentrate on the low hanging fruit and then work on the POCO query layer next, since my goal is to be able to try out MongoDB as an alternative to my NHibernate code.
All that said, the code described above works now and is ready for some test driving. It's currently only in my branch on github, but I hope it will make it into the master soon.
Labels: c#, fluent interface, linq, mongodb, mysql, NHibernate, nosql
Composing remote and local linq queries
One of the cool things with Linq is that queries are composable, i.e.
you can add further query constraints by selecting from an existing query. Nothing is executed until you try to read from the query. This allows IQueryable to compose all the added constraints and transform it into the underlying query structure, most commonly SQL.
However it does come with the pitfall that there are a lot of things legal in Linq expressions that will die at run time. This happens because an expression may not have an equivalent syntax in the transformed language, like calling a method as part of a where clause.
This does not mean that you can't use linq for the portion of the query that is not executable by the provider. As long as you know what expression is affected, you can use query composition to build a query that executes some part remotely and some part against the object model in memory.
Let's suppose we wish to execute the following query:
var q = from e in session.Queryable<Entry>()
where e.Created > DateTime.Parse("2009/9/1")
&& e.Created < DateTime.Now
&& e.Tags.Contains('foo')
select e;
But our query provider doesn't understand the extension method that allows us to check the list of Tags. In order for this query to work, that portion must be executed against the result set of the date range query. We could coerce the first portion to a list or array and then query that portion, but that would just force the date query to be materialized before we could prune the set. Instead we want to feed the stream of matching entries into a second query, composing a new query that contains both portions as a single query and won't access the database until we iterate over it.
To accomplish this I created an extension method that coerces the query into a sequence that yields each item as it is returned by the database query:
public static class LinqAdapter {
public static IEnumerable<T> AsSequence<T>(this IEnumerable<T> enumerable) {
foreach(var item in enumerable) {
yield return item;
}
}
}
UPDATE: As Scott points out in the comments, my AsSequence just re-implements what is already available in Ling as AsEnumerable. So the above really just serves to explain how AsEnumerable defers execution to enumeration rather than query definition.
Anyway, AsSequence or AsEnumerable allows me to compose the query from server and local expressions like this:
var q = from e in session.Queryable<Entry>()
where e.Created > DateTime.Parse("2009/9/1")
&& e.Created < DateTime.Now
select e;
q = from e in q.AsSequence() where e.Tags.Contains('foo') select e;
When q is enumerated, the first expression is converted to SQL and executes against the database. Each item returned from the database is then fed into the second query, which checks its expression and yields the item to the caller, should the expression match. Since
q.AsSequence() is used as part of query composition,
it does not force the first expression to execute at the time of query definition as q.ToList() would. The additional benefit is that even when
q.AsSequence() is executed, it never builds the entire result set in memory as a list to iterate over, but rather just streams each database query result item through its own expression evaluation.
Of course, this still have the performance implications of sending data across the wire and filtering it locally. However, this is not an uncommon problem when SQL alone cannot provide all the filtering. The benefit of this approach is reduced memory pressure on execution, better control when execution occurs and the ability to use Linq syntax to do the secondary filtering.
Labels: c#, linq
Hiding NHibernate with Linq
Been using NHibernate a lot recently, and just love having POCO data objects. This means that i can just hand out objects and manipulate them and then save them back to the DB without ever exposing my persistence model. For this purpose, I'd been using Data Access Objects to give me access to the objects. But like stored procedures, after a while you DAOs have this mess of Find methods on them that either have many overloads or, even worse, optionally nullable parameters. Never mind the acrobatics you jump through once you want to provide methods that query entities, based on some other entity, that's abstracted by its on DAO.
The option was to expose the NHibernate query api (talking ICriteria here, never touched HQL because the reason i went NHibernate in the first place is because i didn't want queries in strings), but that didn't taste right, so i added more methods on my DAO instead, hiding the criteria queries.
Once i started playing with the experimental Linq support, i didn't change my abstraction, just started using Linq instead of criteria inside the DAO. But last week, I finally broke down and just exposed session.Linq<T>(); as an IQueryable<T> on my DAO. For example my Account DAO, AccountService now simply looks like this:
public interface IAccountService
{
Account CreateAccount(JObject json);
Account GetAccount(int accountId);
IQueryable<Account> Accounts { get; }
void Save(Account account);
}
NHibernate is completely invisible, I have full query capabilities and even mocking has become easier, just returning a
List<T>.AsQueryable() from my mock DAO. I was able to remove all my Find methods. I did leave the by Id accessor GetAccount in there, just because it's the predominant use case, and i didn't want to litter code with identical Linq queries.
I have to say, i like this approach a lot.
Labels: DAO, IQueryable, linq, NHibernate
Boolean Algebra for Tag queries
Currently working on a tag query engine and couldn't find anything all that useful in the published approaches. I want to do arbitrary boolean algebras against a set of tags in a database, which seems to be out of scope of SQL approaches. All the various tagging schemas out there reduce to either just
AND or just
OR queries, but not complex logic. However, I want to be able to do something like:
(foo+bar)|(foo+baz)|(bar+^baz)
If there is a way to do this with SQL, i'd love to know. But the way i look at it, i really have to fetch all tags for each item and then do apply that formula to the list of tags on the item.
But let's break down the matching problem itself into something i can execute. Let's assume I've got a simple parser that can turn the above into an AST. Really, i can decompose any variation into three operations, AND(a,b), OR(a,b) and NOT(a). And I can represent those with some simple Func<> definitions:
Func<bool, bool, bool> AND = (a, b) => a && b;
Func<bool, bool, bool> OR = (a, b) => a || b;
Func<bool, bool> NOT = (a) => !a;
Assuming that i have boolean tokens for
foo,
bar and
baz, the expressions becomes:
OR(AND(foo, bar), OR(AND(foo, baz), AND(bar,NOT(baz))))
Now, the above expression can be expressed as a function that takes three booleans describing the presence of the mentioned tags, ignoring any other tags that the item has, returning a boolean indicating a successful match. In C# that expression would look like this:
Func<bool[], bool> f = x => OR(AND(x[0], x[1]), OR(AND(x[0], x[2]), AND(x[1],NOT(x[2]))));
Next we need to generate this boolean map from the list of tags on the item. Assuming that the tag list is a list of strings, we can define an extension methods on IEnumerable<string> to generate the boolean map like this:
public static bool[] GetBinaryMap(this IEnumerable<string> tags, string[] mask) {
var map = new bool[mask.Length];
foreach(var x in tags) {
for(var i = 0; i < mask.Length; i++) {
if(x == mask[i]) {
map[i] = true;
}
}
}
return map;
}
And with this we can define a linq query that will return us all matching items:
var mask = new[] { "foo", "bar", "baz"};
Func<bool[], bool> f = x => OR(AND(x[0], x[1]), OR(AND(x[0], x[2]), AND(x[1],NOT(x[2]))));
var match = from item in items
where f(item.Tags.GetBinaryMap(mask))
select item;
Clearly this is isn't the fastest executing query, since we first had to create our items, each item in which has a collection of tags. But there is a lot of optimizations left on the table here, such as using our tag mask to pre-qualify items, breaking down the AST into sub-matches that could be used against a cache to find items, etc.
But at least we have a fairly simple way to take complex boolean algebra on tags and convert them into something that we can evaluate generically
Labels: boolean algebra, c#, lambda, linq, tags
Searching a Tree of Objects with Linq, Revisited
A while back,
I wrote about searching through a tree using linq to objects. That post was mostly snippets of code about delegates, lambda's, yield and how it applies to linq -- more a technical exploration than an example. So I thought I'd follow it up with concrete extension methods to make virtually any tree searchable by Linq.
Linq, IEnumerable<T>, yield
All that is required to search a tree with Linq is creating a list of all nodes in the tree. Linq to Objects can operate on IEnumerable<T>. Really, Linq to objects is a way of expressing operations we've been doing forever in loops with if/else blocks. That means there isn't any search magic going on, it is a linear traversal of all elements in a set and examining each to determine whether it matches our search criteria.
To turn a tree into a list of node we need to walk and collect all children of every node. A simple task for a recursive list that carries along a list object to stuff every found node into. But there is a better way, using yield to return each item as it is encountered. Now we don't have to carry along a collection. Iterators using yield implement a pattern in which a method can return more than once. For this reason, a method using yield in C# must return an IEnumerable, so that the caller gets a handle to an object it can traverse the result of the multiple return values.
IEnumerable is basically an unbounded set. This is also the reason why unlike collections, it does not have a Count Property. It is entirely possible for an enumerator to return an infinite series of items.
Together IEnumerable<T> and yield are a perfect match for our problem, i.e. recursively walking a tree of nodes and return an unknown number of nodes.
Two types of Tree Traversal
Depth First
In depth-first traversal, the algorithm will dig continue to dig down a nodes children until it reaches a leaf node (a node without children), before considering the next child of the current parent node.
Breadth First
In breadth-first traversal, the algorithm will return all nodes at a particular depth first before considering the children at the next level. I.e. First return all the nodes from level 1, then all nodes from level 2, etc.
Tree to IEnumerable<T> Extension methods
public static class TreeToEnumerableEx
{
public static IEnumerable<T> AsDepthFirstEnumerable<T>(this T head, Func<T, IEnumerable<T>> childrenFunc)
{
yield return head;
foreach (var node in childrenFunc(head))
{
foreach (var child in AsDepthFirstEnumerable(node, childrenFunc))
{
yield return child;
}
}
}
public static IEnumerable<T> AsBreadthFirstEnumerable<T>(this T head, Func<T, IEnumerable<T>> childrenFunc)
{
yield return head;
var last = head;
foreach(var node in AsBreadthFirstEnumerable(head,childrenFunc))
{
foreach(var child in childrenFunc(node))
{
yield return child;
last = child;
}
if(last.Equals(node)) yield break;
}
}
}
This static class provides two extension methods that can be used on any object, as long as it's possible to express a function that returns all children of that object, i.e. the object is a node in some type of tree and has a method or property for accessing a list of its children.
An Example
Let's use a hypothetical Tree model defined by this Node class:
public class Node
{
private readonly List<Node> children = new List<Node>();
public Node(int id)
{
Id = id;
}
public IEnumerable<Node> Children { get { return children; } }
public Node AddChild(int id)
{
var child = new Node(id);
children.Add(child);
return child;
}
public int Id { get; private set; }
}
Each node simply contains a list of children and has an Id, so that we know what node we're looking at. The AddChild() method is a convenience method so we don't expose the child collection and no node can ever be added as a child twice.
The calling convention for a depth-first collection is:
IEnumerable<Node> = node.AsDepthFirstEnumerable(n => n.Children);
The lambda expression n => n.Children is the function that will return the children of a node. It simply states given n, return the value of the Children property of n. A simple test to verify that our extension works and to show us using the extension in linq looks like this:
[Test]
public void DepthFirst()
{
// build the tree in depth-first order
int id = 1;
var depthFirst = new Node(id);
var df2 = depthFirst.AddChild(++id);
var df3 = df2.AddChild(++id);
var df4 = df2.AddChild(++id);
var df5 = depthFirst.AddChild(++id);
var df6 = df5.AddChild(++id);
var df7 = df5.AddChild(++id);
// find all nodes in depth-first order and select just the Id of each node
var IDs = from node in depthFirst.AsDepthFirstEnumerable(x => x.Children)
select node.Id;
// confirm that this list of IDs is in depth-first order
Assert.AreEqual(new int[] { 1, 2, 3, 4, 5, 6, 7 }, IDs.ToArray());
}
For breadth-first collections, the calling convention is:
IEnumerable<Node> = node.AsBreadthFirstEnumerable(n => n.Children);
Again, we can test that the extension works like this:
[Test]
public void BreadthFirst()
{
// build the tree in breadth-first order
var id = 1;
var breadthFirst = new Node(id);
var bf2 = breadthFirst.AddChild(++id);
var bf3 = breadthFirst.AddChild(++id);
var bf4 = bf2.AddChild(++id);
var bf5 = bf2.AddChild(++id);
var bf6 = bf3.AddChild(++id);
var bf7 = bf3.AddChild(++id);
// find all nodes in breadth-first order and select just the Id of each node
var IDs = from node in breadthFirst.AsBreadthFirstEnumerable(x => x.Children)
select node.Id;
// confirm that this list of IDs is in depth-first order
Assert.AreEqual(new int[] { 1, 2, 3, 4, 5, 6, 7 }, IDs.ToArray());
}
Searching Trees
The tree used in the example is of course extremely simple, i.e. it doesn't even have any worthwhile data to query attached to a node. But these extension methods could be used on a node of any kind of tree, allowing the full power of Linq, grouping, aggregation, sorting, projection, etc. to be used on the tree.
As a final note, you may wonder, why bother with depth-first vs. breadth first? After all, in the end we do examine every node! There is however one particular case where the choice of algorithm can be very important: You are looking for one match or a particular number of matches. Since we are using yield, we can terminate the traversal at any time. Using the FirstOrDefault() extension on our Linq expression, the traversal would stop as soon as one match is found. And if have any knowledge where that node might be in the tree, the choice of search algorithm can be a significant performance factor.
Labels: .net, breadth-first, c#, depth-first, Extension method, linq, tree, yield
db40 indexing and query performance
Indexing on
db4o is a bit non-transparent, imho. There's barely a blurp in their Documentation app and it just tells you how to create an index and how to remove it. But you can't easily inspect that one exists, or whether it's being used. So i spent a good bit of time today trying to figure out why my queries were so slow, was an index created and if so, was it being used? The final answer is, if querying is slow in db4o, you're not using an index, because, OMG, you'll know when you do an indexed query.
Index basics
Given an object such as
public class Foo
{
public string Bar;
}
you create an index, globally
(meh) for that object on all databases you create thereafter, with this call:
Db4oFactory.Configure().ObjectClass(typeof(Foo)).ObjectField("Bar").Indexed(true);
So far, straight forward enough. But let's say you're using a property? Well, db4o does its magic by inspecting your underlying storage fields, so you have to index them, not the properties that expose them. That means if our object was supposed to have a readonly property
Bar, like this:
public class Foo
{
private string bar;
public Foo(string bar)
{
this.bar = bar;
}
public string Bar { get { return bar; } }
}
then the field you need to index is actually the private member
bar:
Db4oFactory.Configure().ObjectClass(typeof(Foo)).ObjectField("bar").Indexed(true);
Given this idiosyncrasy, the obvious question is
"what about automatic properties?". Well, as of right now the answer is, no such luck, because you'd have to reflect the underlying storage field that is created and index it, and you don't get any guarantees that field is named the same from compiler to compiler or version to version. That probably also means, that automatic properties are dangerous all around, because you may never get your data back if the storage changes, although on that conclusion i'm just speculating wildly.
Query performance
Index in hand, I decided to populate a DB, always checking if the existing item already existed, using a db4o native query. That started at 1 ms query time and then linearly increased with every item added. That sure didn't seem like an indexed search to me. I finally discovered a useful resource on the db4o site, but unfortunately it's behind a login, so google didn't help me find it and my
link to it will only take you to the login. That's a shame because this bit of information ought to be somewhere in big bold letters!
You
must have the following DLLs available for Native Queries to be optimized into SODA queries, which apparently is the format that hits the index:
- Db4obects.Db4o.Instrumentation.dll
- Db4objects.Db4o.NativeQueries.dll
- Mono.Cecil.dll
- Cecil.FlowAnalysis.dll
The query will execute fine, regardless of their presence, but the performance difference between the optimized, index using query and the unoptimized native query is orders of magnitude. My queries went from 100-500ms to 0.01ms, just by dropping those DLLs into my executable directory. Yeah, that's a useful change.
Interestingly enough, the same is not required for linq queries. They seem to hit the index without the extra help (although just to even run, Mono.Cecil and Cecil.FlowAnalysis need to be present, so here you at least get an error). There currently appears to be about 1ms overhead for parsing linq into SODA, but i'll take that hit for the syntactic sugar.
Conclusions
I'm pretty happy with simplicity and performance of
db4o so far. It seems like an ideal local, queryable persistence layer. The way it works does want to make me abstract my data model into simple data objects that are then converted into business entities. I'd rather have the attribute based markup of ActiveRecord, but that's not a deal breaker.
Labels: c#, db4o, linq, query performance
Db4o on .NET and Mono
After failing to get a cross-platform sample of NHibernate/Sqlite going, I decided to try out Db4o. This is for a simple, local object persistence layer anyhow, nothing more than a local cache, so db4o sounded perfect.
The initial DLLs for 7.4 worked beautifully on .NET but ran into problems on Mono. Apparently db4o imports FlushFileBuffers from kernel32.dll if your build target is not CF or mono. And in its call to FlushFileBuffers it uses FileStream.SafeFileHandle.DangerousGetHandle() which it not yet implemented under Mono, resulting in this exception:
Unhandled Exception: System.NotImplementedException: The requested feature is no
t implemented.
at System.IO.FileStream.get_SafeFileHandle () [0x00000]
at Sharpen.IO.RandomAccessFile.Sync () [0x00000]
at Db4objects.Db4o.IO.RandomAccessFileAdapter.Sync () [0x00000]
...
I found
this page on the Db4o site, which suggested just falling back to
FileStream.Handle. However, that for me just resulted in this:
Unhandled Exception: System.EntryPointNotFoundException: FlushFileBuffers
at (wrapper managed-to-native) Sharpen.IO.RandomAccessFile:FlushFileBuffers (intptr)
at Sharpen.IO.RandomAccessFile.Sync () [0x00000]
at Db4objects.Db4o.IO.RandomAccessFileAdapter.Sync () [0x00000]
...
So, i simply defined MONO as a compilation symbol in visual studio and rebuilt it. I figure the only time this code will run on Windows is during testing, so treating it as mono is fine. And that did solve my issues and i now have a DLL for db40 7.4 that works beautifully across .NET and mono from a single build.
Being a Linq nut, I immediately decided to skip the Native Query syntax and dive into using the Linq syntax instead. Which worked great on mono 2.0.1, but unfortunately on the current Redhat rpm (stuck back in 1.9.1 lang), the Linq implementation isnt' quite complete and you get this:
Unhandled Exception: System.NotImplementedException: The requested feature is not implemented.
at System.Linq.Expressions.MethodCallExpression.Emit (System.Linq.Expressions.EmitContext ec) [0x00000]
at System.Linq.Expressions.LambdaExpression.Emit (System.Linq.Expressions.EmitContext ec) [0x00000]
at System.Linq.Expressions.LambdaExpression.Compile () [0x00000]
at Db4objects.Db4o.Linq.Expressions.SubtreeEvaluator.EvaluateCandidate (System.Linq.Expressions.Expression expression) [0x00000]
...
But falling back from this syntax:
var items = from RosterItem r in db
where r.CreatedAt > DateTime.Now.Subtract(TimeSpan.FromMinutes(10))
select r;
to the NativeQuery syntax (with delegates replaced by lambda's):
var items = db.Query<RosterItem>(r => r.CreatedAt > DateTime.Now.Subtract(TimeSpan.FromMinutes(10)));
It's still a fairly compact and straight forward syntax, so until i finish setting up my own Centos mono RPMs, i'll stick to this syntax.
I need to run db4o through some more serious paces, but I like what I see so far.
Labels: .net, c#, db4o, linq
Saying LINQ is about databases is missing its true benefits
Just came across a
long discussion about
LINQ in Java on the ODBMS blog (thanks to
Miguel's tweet). There is some excellent discussion in there, but aside from a couple of people the discussion seemed to largely center on
- It's from MS so it's a bad idea to copy, and
- I don't think it adds anything over normal SQL syntax
The first is an unfortunate dismissal of a very powerful functional language construct because of its origin. And the second illustrates that the commenter does not truly understand what LINQ brings to the language in the first place.
Of course, I'd bet that 90% of .NET developers, if polled, would also equate LINQ with "type-safe SQL in the language", so this isn't a dig against Java people. Hopefully as Parallel LINQ gains some traction, this simplification will loose it's hold on people.
LINQ or Language INtegrated Query is really a functional way of expression operations on collections. And if you decompose a lot of code, anywhere where you are using loops to manipulating collections, LINQ is likely to create a more concise and powerful expression for the same operation. And being functional, the implementation of how LINQ does this is opaque to the caller. The caller simply describes what operations should be done on the data sources, allowing for optimization of the operations based on the data sources. That means they could be in a database, they could come from XML, or REST calls or simply exist as an in-memory object graph. But none of these things change the transformations desired.
I've only used LINQ once for SQL, although I use LINQ to objects, i.e. against IEnumerable<T>, almost daily and it's done away with a lot of foreach with temporary variables, temporary lists, etc. But even that scenario against the apparently now deprecated DLINQ or Linq2Sql illustrated how it wasn't just about replacing SQL with type-safe syntax, it allowed me to use one syntax for both database and local operations.
This project included doing a bunch of analytical processing against the data, including projection combining a number data sources. Not all of this was expressible in SQL and some of it wasn't a wise use of live DB queries, and performing the additional work in memory was a lot more efficient. Traditionally this type of work would exhibit a fairly obvious syntactic break between the local and the DB operations. And moving some part (say a sort of a sub-set) from SQL to local or vice versa would be a significant re-write. But using LINQ, the syntax was identical, it was merely a matter of deciding at what point the query should be turned into concrete data vs. a cursor against the DB. This is simply done by turning any IEnumerable to a List, forcing immediate execution of the query represented by the IEnumerable source. Either way, local or remote, the syntax stayed the same the power of where processing should happen was in my hands.
I do support the goal of getting something akin to LINQ in java, but I sure hope they don't attack the problem by creating some DB-centric query DSL. The greatest benefit of LINQ in .NET, imho, is that instead of hacking its syntax into the language, the building blocks of anonymous delegates, lambda expressions, anonymous types, var syntax and object initializers. Each one of these pieces is a fundamental part of C# 3.0 and can be used independently, but together they allow LINQ to exist. Discussion of common fluent APIs for databases or whole new languages like SBQL miss the benefit of "Language Integrated" in LINQ.
Going to be interesting to follow how this evolves, since I personally think that LINQ is one of the key differentiators between Java and C# that isn't just syntactic sugar (even though many think it is just that).
Labels: c#, java, linq
LINQ: Immutability vs. Deferred execution
The last couple of nights I've been playing with some
Linq to Sql and a whole lot of
Linq to Objects and I have to say where coming up with complex Regular Expressions used to be one of my favorite puzzles, coming up with complex projections and transformations through Linq is quickly taking its place. Simple Linq is well documented, but when it comes to aggregation, it's a lot sparser. I expect to write more of that up once I feel more comfortable with the syntax.
In the meantime, I wanted to write up some non-obvious observation about deferred execution with Linq. Considering the gotchas with lambdas, it's easy to extend the lessons learned to linq, since it is after all deferred execution. But what's different with Linq is that, while execution is deferred, the expression tree built via a query is also immutable. I came across this trying to do some simple query re-use.
Let's start with a simple DTO:
public class Order
{
public Order(int id, int val, bool buyOrder)
{
Id = id;
Value = val;
IsBuyOrder = buyOrder;
}
public int Id { get; set; }
public int Value { get; set; }
public bool IsBuyOrder { get; set; }
}
And a set of this data:
Order[] orders = new Order[]
{
new Order(1,2,true),
new Order(2,2,false),
new Order(3,4,true),
new Order(4,4,false),
new Order(5,6,true),
new Order(6,6,false),
};
Let's split those into buy and sell orders:
var buyOrders = from order in orders
where order.IsBuyOrder
select order;
var sellOrders = from order in orders
where !order.IsBuyOrder
select order;
If we want to find the buy and the sell order with a value of 2, you'd think we could write one query and re-use it for both of those queries. Since both queries results in IEnumerable<Order>, how about we define a query source and assign the value of either above query.
IEnumerable<Order> orders2 = null;
var orderAtTwo = from order in orders2
where order.Value == 2
select order;
orders2 = buyOrders;
int buyOrderId = orderAtTwo.First().Id;
orders2 = sellOrders;
int sellOrderId = orderAtTwo.First().Id;
Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
Since the query is deferred until we call
.First() on it, that seems like a reasonable syntax. Except this will result in an
System.ArgumentNullException because our query grabbed a reference to
orders2 at query definition, even though the query won't be executed until later. Giving
orders2 a new value does not change the original reference in the immutable expression tree.
A way around this is to replace the actual contents of orders2. However, for us to do that, we have to turn it into the query source into a collection first.
orders2.Clear();
orders2.AddRange(buyOrders);
int buyOrderId = orderAtTwo.First().Id;
orders2.Clear();
orders2.AddRange(sellOrders);
int sellOrderId = orderAtTwo.First().Id;
Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
This gives us the expected
Let's put aside the awkwardness of clearing out a list and stuffing data back in, this code has another unfortunate sideeffect.
.AddRange() actually executes the query passed to it, so we execute our buy and sell queries to populate
orders2 and then execute
orderAtTwo twice against those collections.
The beauty of linq is that if you create a query from a query, your not running multiple queries, but building a more complex query to be executed. So, what we really want is query "re-use" that results in single expression trees at execution time.
To achieve this, we need to move the shared query into a separate method such as:
private IEnumerable<Order> GetTwo(IEnumerable<Order> source)
{
return from order in source
where order.Value == 2
select order;
}
and the code becomes:
int buyOrderId = GetTwo(buyOrders).First().Id;
int sellOrderId = GetTwo(sellOrders).First().Id;
Console.WriteLine("buy Id: {0}, sell Id: {1}", buyOrderId, sellOrderId);
This gives the same output as above, and we're only running two queries, each against the original collection. The method call means that we don't get to re-use an expression tree, since it builds a new one, combining the expression tree passed to it with the one it builds itself.
Labels: c#, lambda, linq
Searching a Tree of Objects with Linq
UPDATE: Posted a follow-up
here.
I've finally had legitimate use for LINQ to Objects, not just to make the syntax cleaner, but also to simplify the underlying code and provide me a lot of flexibility without significant work.
The scenario
I have a tree of objects that have both a type and a name. The name is unique, the Type is not. The interface is this:
public interface INode
{
int Id { get; }
string Name { get; set; }
string Type { get; set; }
List<INode> Children { get; }
}
I want to be able to find a single named Node in the tree and I want to be able to retrieve a collection of all nodes for a particular type. The searchable interface could be expressed as this:
public interface ISearchableNode : INode
{
INode FindByName(string name);
IEnumerable<INode> FindByType(string name);
}
Both require me to walk the tree and examine each node, so clearly I just want to have one walk routine and generically evaluate the node in question. In C# 2.0 parlance, that means I could pass an anonymous delegate into my generic find routine and have it recursively iterate through all the children. I also pass along a resultset to be populated.
The signature for the evaluation delegate looks like this:
delegate bool FindDelegate(INode node);
but since I'm using C# 3.0 (i.e. .NET 3.5) I can use
lambda expressions to avoid creating a delegate and simplify my syntax. Instead of
FindDelegate, I can simply use
Func<INode,bool>:
// Instead of this:
private void Find(INode node, List<INode> resultSet, FindDelegate findDelegate);
// called like this for a Name search:
Find(this, resultSet, delegate(INode node) { return node.Name == name; });
// I can use this:
private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
// called like this:
Find(this, resultSet, node => node.Name == name);
Thus giving me the following implementation for ISearchableNode:
public INode FindByName(string name)
{
List<INode> resultSet = new List<INode>();
Find(this, resultSet, x => x.Name == name);
return resultSet.FirstOrDefault();
}
public IEnumerable<INode> FindByType(string type)
{
List<INode> resultSet = new List<INode>();
Find(this, resultSet, x => x.Type == type);
return (IEnumerable<INode>)resultSet;
}
private void Find(INode node, List<INode> resultSet, Func<INode, bool> f)
{
if (f(node))
{
resultSet.Add(node);
}
foreach (INode child in node.Children)
{
Find(child, resultSet, f);
}
}
Problem solved, move on... Well, except there is significant room for improvement. Here are the two main issues that ought to be resolved:
- Syntax is limited to two types of searches and exposing the generic find makes for an ugly syntax. It would be much nicer if queries to the tree could be expressed in LINQ syntax.
- It's also inefficient for the Name search, since I'm walking the entire tree, even if the first node matched the criteria.
LINQ to Hierarchical Data
In order to use LINQ to objects, I need to either create a custom query provider or implement IEnumerable. The latter is significantly simpler and could be expressed using the following interface:
public interface IQueryableNode : IEnumerable<INode>
{
}
Ok, ok, I don't even need an interface, I could just implement IEnumerable<INode>... But what does that actually mean? In the simplest sense, I'm iterating over the node's children, however, I also with do descend into the children's children and so on. So a simple foreach won't do. I could just do the same tree walking with a resultset as I did above and return the Enumerator of the resulting list to implement the interface, but C# 2.0 introduced a much more useful way to implement non-linear Iterators, i.e. the yield keyword. Instead of building a list to be interated over, yield let's the iterating code return values as they are found, which means it can be used for recursive iteration. Thus the GetEnumerator is implemented simply as follows
#region IEnumerable<Node> Members
public IEnumerator<INode> GetEnumerator()
{
yield return this;
foreach (Node child in Children)
{
foreach (Node subchild in child)
{
yield return subchild;
}
}
}
#endregion
#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
#endregion
Nice and simple and ready for LINQ.
Searching for all Nodes of a type becomes
var allBar = from n in x
where n.Type == "bar"
select n;
foreach (Node n in allBar)
{
// do something with that node
}
and the search for a specifically named node becomes
INode node = (from n in x
where n.Name == "i"
select n).FirstOrDefault();
But the real benefit of this approach is that I don't have hard-coded search methods, but can express much more complex queries in a very natural syntax without any extra code on the Node.
Deferred execution
As it turns out, using yield for the recursive iteration also solved the second issue. As yield returns values as it encounters them during iteration, the search doesn't happen until the query is executed. And one of the side effects of LINQ syntax is that creating a query does not execute it until the result set is iterated over. Therefore, FirstOrDefault() actually short-circuits the query as soon as the first match (and in case of Name, it's going to be the only match) is hit.
Labels: .NET 3.5, c#, lambda, linq, tree, yield