Recently there’s been a lot of people lamenting the sheep like mentality of picking RDBMS (and with it ORMs) as the way to model persistence, without first considering solutions that do not suffer the object-relational impedance mismatch.
Many of the arguments for having to use RDBMS’ are easily shot down, such as the relentless requirements for adhoc reporting against production data (If your OLTP and OLAP are the same DB you are doing it wrong™.) But just because the arguments for picking an RDBMS are often ill-considered, the reasons for abandoning it also seem to suffer from some depth of consideration.
Let me be clear that I do my best to stay away from RDBMS’ whenever i can. I have plenty of scars from supporting large production DB environments over the years and there are lots of pain points in writing web applications against RDBMS’. I, too, love schema-less, document and object databases. They make so much sense. I rabidly follow the MongoDB and Riak mailing lists and prototype projects with them and others NoSQL tech, such as Db4o. However, following those lists it is clear to me that a) they are still re-discovering lessons painfully learned by RDBMS folks and b) my knowledge of working with these systems when something goes wrong is woefully behind my knowledge of the same for RDBMS.
Pick the best tool
So yes, marvel at the simplicity of mapping your object model to a document model, or even serialize that object graph using an object or graph DB. But don’t just concentrate on what they do better for development, ignoring the day-to-day production support issues. Take a minute and see if you can answer these questions for yourself:
Can you troubleshoot performance problems?
Every RDBMS has some kind of profiling tool and process list. And on the ORM side, Ayende‘s Uberprof is doing a fantastic job of bringing additional transparency to many ORMs. Do you have any similar tools for the alternative persistence layer? Do you know what’s blocking your writes, your reads? What’s slowing down your map/reduce? What indicies, if applicable, are being hit? And if you’re using a sharded setup, profiling just got an order of magnitude more complicated.
What about concurrency on non-key accesses?
Key/value stores are much faster than even primary key hits on RDBMS. And document databases let you store the entire data hierarchy instead of normalizing them across foreign key tables making graph retrieval cheap too.
But as NoSQL goes beyond simple key retrieval with query APIs and map/reduce, concurrency concerns sneak back in along with the query power. Many NoSQL stores are still using single threaded concurrency per node or at least data silos (read: table locking).In RDBMS land, mysql was the last one to solve that and it did it 6-7 years ago.
What tools to you have to recover a corrupted data file?
Another set of tools you are guaranteed to find with any RDBMS are utilities for recovering corrupted data and index files. Or at the very least utilities for extracting data from them in case of catastrophic failure.
With many NoSQL stores using memory mapped files, corruption on power loss or DB crash is not uncommon. Does you persistence choice have ways to recover those files?
What’s your backup strategy?
Most DBs have non-blocking DB dumps. Almost all have replication. Both are valid mechanisms.
Some NoSQL stores use replication to address the problem, others seem to punt on it by using redundant data duplication across nodes. But unless your redundant/replica nodes are geographically co-located, it’s not the same as being able to go back to a backup on catastrophic loss.
How do you know your replicas are working?
So you say, you don’t care if your data gets corrupted or that you can’t do live backups, because it all gets replicated to a safe server. Well, much like going back to tape only to discover that your back-up process hasn’t actually backed up anything, do you have the tools to ensure that your replicas are up to date and didn’t get the corruption replicated into them?
Do your sysadmins share your comfort level?
A lot of these production level and back-up related issues are not even something developers think about, because with the maturity of RDBMS’ their maintenance and back-up are often tightly integrated into the sysadmin’s processes. If you don’t think you need to care about the above questions, chances are you have others doing it for you. And in that case, it’s vital that your sysadmins are versed the in NoSQL tool you are choosing before you throw the operations requirements over the wall at them.
The tool you know
Maybe you have all those questions covered for your NoSQL tool of choice. Google, Facebook, LinkedIn do. But likely, you don’t. Maybe you don’t have them covered for any RDBMS that you know either. But here’s the difference: These problems have been tackled in painstaking detail in thousands of RDBMS production environments. So, when you hit a wall with an RDBMS, chances are you can find an answer and get yourself out of that production mess.
The relative novelty and deployment size of most NoSQL solutions means you can’t easily fall back on established production experience. Until you have that same certainty when, not if, you face problems in production, you can’t really say that you objectively evaluated all choices and found NoSQL to be the superior solution to your problem.