Recently there’s been a lot of people lamenting the sheep like mentality of picking RDBMS (and with it ORMs) as the way to model persistence, without first considering solutions that do not suffer the object-relational impedance mismatch.
Many of the arguments for having to use RDBMS’ are easily shot down, such as the relentless requirements for adhoc reporting against production data (If your OLTP and OLAP are the same DB you are doing it wrong™.) But just because the arguments for picking an RDBMS are often ill-considered, the reasons for abandoning it also seem to suffer from some depth of consideration.
Let me be clear that I do my best to stay away from RDBMS’ whenever i can. I have plenty of scars from supporting large production DB environments over the years and there are lots of pain points in writing web applications against RDBMS’. I, too, love schema-less, document and object databases. They make so much sense. I rabidly follow the MongoDB and Riak mailing lists and prototype projects with them and others NoSQL tech, such as Db4o. However, following those lists it is clear to me that a) they are still re-discovering lessons painfully learned by RDBMS folks and b) my knowledge of working with these systems when something goes wrong is woefully behind my knowledge of the same for RDBMS.
So yes, marvel at the simplicity of mapping your object model to a document model, or even serialize that object graph using an object or graph DB. But don’t just concentrate on what they do better for development, ignoring the day-to-day production support issues. Take a minute and see if you can answer these questions for yourself:
Every RDBMS has some kind of profiling tool and process list. And on the ORM side, Ayende‘s Uberprof is doing a fantastic job of bringing additional transparency to many ORMs. Do you have any similar tools for the alternative persistence layer? Do you know what’s blocking your writes, your reads? What’s slowing down your map/reduce? What indicies, if applicable, are being hit? And if you’re using a sharded setup, profiling just got an order of magnitude more complicated.
Key/value stores are much faster than even primary key hits on RDBMS. And document databases let you store the entire data hierarchy instead of normalizing them across foreign key tables making graph retrieval cheap too.
But as NoSQL goes beyond simple key retrieval with query APIs and map/reduce, concurrency concerns sneak back in along with the query power. Many NoSQL stores are still using single threaded concurrency per node or at least data silos (read: table locking).In RDBMS land, mysql was the last one to solve that and it did it 6-7 years ago.
Another set of tools you are guaranteed to find with any RDBMS are utilities for recovering corrupted data and index files. Or at the very least utilities for extracting data from them in case of catastrophic failure.
With many NoSQL stores using memory mapped files, corruption on power loss or DB crash is not uncommon. Does you persistence choice have ways to recover those files?
Most DBs have non-blocking DB dumps. Almost all have replication. Both are valid mechanisms.
Some NoSQL stores use replication to address the problem, others seem to punt on it by using redundant data duplication across nodes. But unless your redundant/replica nodes are geographically co-located, it’s not the same as being able to go back to a backup on catastrophic loss.
So you say, you don’t care if your data gets corrupted or that you can’t do live backups, because it all gets replicated to a safe server. Well, much like going back to tape only to discover that your back-up process hasn’t actually backed up anything, do you have the tools to ensure that your replicas are up to date and didn’t get the corruption replicated into them?
A lot of these production level and back-up related issues are not even something developers think about, because with the maturity of RDBMS’ their maintenance and back-up are often tightly integrated into the sysadmin’s processes. If you don’t think you need to care about the above questions, chances are you have others doing it for you. And in that case, it’s vital that your sysadmins are versed the in NoSQL tool you are choosing before you throw the operations requirements over the wall at them.
Maybe you have all those questions covered for your NoSQL tool of choice. Google, Facebook, LinkedIn do. But likely, you don’t. Maybe you don’t have them covered for any RDBMS that you know either. But here’s the difference: These problems have been tackled in painstaking detail in thousands of RDBMS production environments. So, when you hit a wall with an RDBMS, chances are you can find an answer and get yourself out of that production mess.
The relative novelty and deployment size of most NoSQL solutions means you can’t easily fall back on established production experience. Until you have that same certainty when, not if, you face problems in production, you can’t really say that you objectively evaluated all choices and found NoSQL to be the superior solution to your problem.
This weekend has been a hack-a-thon, trying to build a simple linq provider for MongoDB. I’m using Sam Corder, et al.’s excellent C# MongoDB Driver as the query pipeline, so my provider really is just a translator from Linq syntax to Mongo Document Query syntax. I call it a hack-a-thon, because it’s my first linq provider attempt and, boy, is that query translator state machine ugly already. However, I am covering every bit of syntax with tests, so that once i understand it all better, i can rewrite the translator in a cleaner fashion.
My goals for this provider is to replace a document storage layer i’ve built for a new notify.me project using NHibernate against mysql. This is in no way a judgment against NHibernate. It just happens that for this project, my schema is a heavily denormalized json document database. While fluent NHibernate made it a breeze to let me map it into mysql, it’s really an abuse of an RDBMS. It was a case of prototyping with what you know, but now it’s time to evaluate whether a document database is the way to go.
Replacing existing NHibernate code does mean that, eventually, i want the provider to work with POCO entities and use a fully strong-typed query syntax. But that layer will be built on top of the string-key based version i’m building right now. The string-key based version will be the primary layer, so that you never loose any of the schema-less flexibility of MongoDB, unless you choose to.
So, lacking an entity with named properties to map against, what does the syntax look like right now? First thing we need is an IQueryable<Document> which is created like this:
var mongo = new Mongo(); var queryable = mongo["db"]["collection"].AsQueryable();
Given the queryable, the queries can be built using the Document indexer like this:
var q = from d in queryable where (string)d["foo"] == "bar" select d;
The Document returns an object, which means a cast is unfortunately required on one side of the conditional. Alternatively, Equals, either the static or instance version, also works, alleviating the need for a cast:
var q = from d in queryable where Equals(d["foo"], "bar") select d;
// OR
var q = from d in queryable where d["foo"].Equals("bar") select d;
Better, but it’s not as nice as operator syntax would be, if we could get rid of the casts..
As it turns out there is a number of query operators in MongoDB that don’t have an equivalent syntax in Linq, so a helper class to generate query expression was already needed. The helper is instantiated via the Document extension method .Key(key), giving us the opportunity to overload operators for the various types recognized by MongoDB’s BSON. This allows for the following conditional syntax:
var q = from d in queryable
where d.Key("type") == "customer" &&
d.Key("created") >= DateTime.Parse("2009/09/27")
d.Key("status") != "inactive"
select d;
In addition to normal conditional operators, the query expression helper class also defines IN and NOT IN syntax:
var in = from d in queryable where d.Key("foo").In("bar", "baz") select d;
var notIn = from d in queryable where d.Key("foo").NotIn("bar", "baz") select d;
The helper will be the point of extension to support more of MongoDB’s syntax, so that most query definitions will use the d.Key(key) syntax.
Linq has matching counter parts of MongoDB’s findOne(), limit() and skip(), in First or FirstOrDefault, Take and Skip respectively, and the current version of Linq provider already supports them.
There is a lot in Linq that will likely never be supported, since MongoDB is not a relational DB. That means joins, sub-queries, etc. will not covered by the provider. Anything that does map to MongoDB’s capabilities, though, will be added over time. The low hanging fruit are Count() and order by, with group by following thereafter.
Surprisingly, || (or conditionals) are not going to happen as fast, since aside from or type queries using the .In syntax, it is not directly supported by MongoDB. In order to perform || queries, the query has to be written as a javascript function, which would basically mean that as soon as a single || shows up in the where clause the query translato would have to rewrite all other conditions in javascript as well. So, that’s a bit more on the nice to have end of the spectrum of priorities.
I will most likely concentrate on the low hanging fruit and then work on the POCO query layer next, since my goal is to be able to try out MongoDB as an alternative to my NHibernate code.
All that said, the code described above works now and is ready for some test driving. It’s currently only in my branch on github, but I hope it will make it into the master soon.