NoSQL is the new Arcadia
Recently there's been
a lot of people lamenting the sheep like mentality of picking RDBMS (and with it ORMs) as the way to model persistence, without first considering solutions that do not suffer the
object-relational impedance mismatch.
Many of the arguments for having to use RDBMS' are easily shot down, such as the relentless requirements for adhoc reporting against production data (If your OLTP and OLAP are the same DB you are doing it wrong.) But just because the arguments for picking an RDBMS are often ill-considered, the reasons for abandoning it also seem to suffer from some depth of consideration.
Let me be clear that I do my best to stay away from RDBMS' whenever i can. I have plenty of scars from supporting large production DB environments over the years and there are lots of pain points in writing web applications against RDBMS'. I, too, love schema-less, document and object databases. They make so much sense. I rabidly follow the MongoDB and Riak mailing lists and prototype projects with them and others NoSQL tech, such as Db4o. However, following those lists it is clear to me that a) they are still re-discovering lessons painfully learned by RDBMS folks and b) my knowledge of working with these systems when something goes wrong is woefully behind my knowledge of the same for RDBMS.
Pick the best tool
So yes, marvel at the simplicity of mapping your object model to a document model, or even serialize that object graph using an object or graph DB. But don't just concentrate on what they do better for development, ignoring the day-to-day production support issues. Take a minute and see if you can answer these questions for yourself:
Can you troubleshoot performance problems?
Every RDBMS has some kind of profiling tool and process list. And on the ORM side, Ayende's Uberprof is doing a fantastic job of bringing additional transparency to many ORMs. Do you have any similar tools for the alternative persistence layer? Do you know what's blocking your writes, your reads? What's slowing down your map/reduce? What indicies, if applicable, are being hit? And if you're using a sharded setup, profiling just got an order of magnitude more complicated.
What about concurrency on non-key accesses?
Key/value stores are much faster than even primary key hits on RDBMS. And document databases let you store the entire data hierarchy instead of normalizing them across foreign key tables making graph retrieval cheap too.
But as NoSQL goes beyond simple key retrieval with query APIs and map/reduce, concurrency concerns sneak back in along with the query power. Many NoSQL stores are still using single threaded concurrency per node or at least data silos (read: table locking).In RDBMS land, mysql was the last one to solve that and it did it 6-7 years ago.
What tools to you have to recover a corrupted data file?
Another set of tools you are guaranteed to find with any RDBMS are utilities for recovering corrupted data and index files. Or at the very least utilities for extracting data from them in case of catastrophic failure.
With many NoSQL stores using memory mapped files, corruption on power loss or DB crash is not uncommon. Does you persistence choice have ways to recover those files?
What's your backup strategy?
Most DBs have non-blocking DB dumps. Almost all have replication. Both are valid mechanisms.
Some NoSQL stores use replication to address the problem, others seem to punt on it by using redundant data duplication across nodes. But unless your redundant/replica nodes are geographically co-located, it's not the same as being able to go back to a backup on catastrophic loss.
How do you know your replicas are working?
So you say, you don't care if your data gets corrupted or that you can't do live backups, because it all gets replicated to a safe server. Well, much like going back to tape only to discover that your back-up process hasn't actually backed up anything,
do you have the tools to ensure that your replicas are up to date and didn't get the corruption replicated into them?
Do your sysadmins share your comfort level?
A lot of these production level and back-up related issues are not even something developers think about, because with the maturity of RDBMS' their maintenance and back-up are often tightly integrated into the sysadmin's processes. If you don't think you need to care about the above questions, chances are you have others doing it for you. And in that case, it's vital that your sysadmins are versed the in NoSQL tool you are choosing before you throw the operations requirements over the wall at them.
The tool you know
Maybe you have all those questions covered for your NoSQL tool of choice. Google, Facebook, LinkedIn do. But likely, you don't. Maybe you don't have them covered for any RDBMS that you know either. But here's the difference: These problems have been tackled in painstaking detail in thousands of RDBMS production environments. So, when you hit a wall with an RDBMS, chances are you can find an answer and get yourself out of that production mess.
The relative novelty and deployment size of most NoSQL solutions means you can't easily fall back on established production experience. Until you have that same certainty when, not if, you face problems in production, you can't really say that you objectively evaluated all choices and found NoSQL to be the superior solution to your problem.
Labels: db4o, mongodb, nosql, rdbms, riak
db4o 7.4 binaries for mono
As I
talked about recently, the standard binaries for db4o have some issues with mono, so I recompiled the unmodified source with the MONO configuration flag. I've packed up both the debug and release binaries and you can get them
here. These are just the binaries (plus license). It's not the full db4o package. If you want the full package, just get it directly from the
db4o site, since the MONO config flag and have Visual Studio rebuild the package.
This package should show up on the official db4o mono page shortly as well.
Labels: c#, db4o, mono
db40 indexing and query performance
Indexing on
db4o is a bit non-transparent, imho. There's barely a blurp in their Documentation app and it just tells you how to create an index and how to remove it. But you can't easily inspect that one exists, or whether it's being used. So i spent a good bit of time today trying to figure out why my queries were so slow, was an index created and if so, was it being used? The final answer is, if querying is slow in db4o, you're not using an index, because, OMG, you'll know when you do an indexed query.
Index basics
Given an object such as
public class Foo
{
public string Bar;
}
you create an index, globally
(meh) for that object on all databases you create thereafter, with this call:
Db4oFactory.Configure().ObjectClass(typeof(Foo)).ObjectField("Bar").Indexed(true);
So far, straight forward enough. But let's say you're using a property? Well, db4o does its magic by inspecting your underlying storage fields, so you have to index them, not the properties that expose them. That means if our object was supposed to have a readonly property
Bar, like this:
public class Foo
{
private string bar;
public Foo(string bar)
{
this.bar = bar;
}
public string Bar { get { return bar; } }
}
then the field you need to index is actually the private member
bar:
Db4oFactory.Configure().ObjectClass(typeof(Foo)).ObjectField("bar").Indexed(true);
Given this idiosyncrasy, the obvious question is
"what about automatic properties?". Well, as of right now the answer is, no such luck, because you'd have to reflect the underlying storage field that is created and index it, and you don't get any guarantees that field is named the same from compiler to compiler or version to version. That probably also means, that automatic properties are dangerous all around, because you may never get your data back if the storage changes, although on that conclusion i'm just speculating wildly.
Query performance
Index in hand, I decided to populate a DB, always checking if the existing item already existed, using a db4o native query. That started at 1 ms query time and then linearly increased with every item added. That sure didn't seem like an indexed search to me. I finally discovered a useful resource on the db4o site, but unfortunately it's behind a login, so google didn't help me find it and my
link to it will only take you to the login. That's a shame because this bit of information ought to be somewhere in big bold letters!
You
must have the following DLLs available for Native Queries to be optimized into SODA queries, which apparently is the format that hits the index:
- Db4obects.Db4o.Instrumentation.dll
- Db4objects.Db4o.NativeQueries.dll
- Mono.Cecil.dll
- Cecil.FlowAnalysis.dll
The query will execute fine, regardless of their presence, but the performance difference between the optimized, index using query and the unoptimized native query is orders of magnitude. My queries went from 100-500ms to 0.01ms, just by dropping those DLLs into my executable directory. Yeah, that's a useful change.
Interestingly enough, the same is not required for linq queries. They seem to hit the index without the extra help (although just to even run, Mono.Cecil and Cecil.FlowAnalysis need to be present, so here you at least get an error). There currently appears to be about 1ms overhead for parsing linq into SODA, but i'll take that hit for the syntactic sugar.
Conclusions
I'm pretty happy with simplicity and performance of
db4o so far. It seems like an ideal local, queryable persistence layer. The way it works does want to make me abstract my data model into simple data objects that are then converted into business entities. I'd rather have the attribute based markup of ActiveRecord, but that's not a deal breaker.
Labels: c#, db4o, linq, query performance
Db4o on .NET and Mono
After failing to get a cross-platform sample of NHibernate/Sqlite going, I decided to try out Db4o. This is for a simple, local object persistence layer anyhow, nothing more than a local cache, so db4o sounded perfect.
The initial DLLs for 7.4 worked beautifully on .NET but ran into problems on Mono. Apparently db4o imports FlushFileBuffers from kernel32.dll if your build target is not CF or mono. And in its call to FlushFileBuffers it uses FileStream.SafeFileHandle.DangerousGetHandle() which it not yet implemented under Mono, resulting in this exception:
Unhandled Exception: System.NotImplementedException: The requested feature is no
t implemented.
at System.IO.FileStream.get_SafeFileHandle () [0x00000]
at Sharpen.IO.RandomAccessFile.Sync () [0x00000]
at Db4objects.Db4o.IO.RandomAccessFileAdapter.Sync () [0x00000]
...
I found
this page on the Db4o site, which suggested just falling back to
FileStream.Handle. However, that for me just resulted in this:
Unhandled Exception: System.EntryPointNotFoundException: FlushFileBuffers
at (wrapper managed-to-native) Sharpen.IO.RandomAccessFile:FlushFileBuffers (intptr)
at Sharpen.IO.RandomAccessFile.Sync () [0x00000]
at Db4objects.Db4o.IO.RandomAccessFileAdapter.Sync () [0x00000]
...
So, i simply defined MONO as a compilation symbol in visual studio and rebuilt it. I figure the only time this code will run on Windows is during testing, so treating it as mono is fine. And that did solve my issues and i now have a DLL for db40 7.4 that works beautifully across .NET and mono from a single build.
Being a Linq nut, I immediately decided to skip the Native Query syntax and dive into using the Linq syntax instead. Which worked great on mono 2.0.1, but unfortunately on the current Redhat rpm (stuck back in 1.9.1 lang), the Linq implementation isnt' quite complete and you get this:
Unhandled Exception: System.NotImplementedException: The requested feature is not implemented.
at System.Linq.Expressions.MethodCallExpression.Emit (System.Linq.Expressions.EmitContext ec) [0x00000]
at System.Linq.Expressions.LambdaExpression.Emit (System.Linq.Expressions.EmitContext ec) [0x00000]
at System.Linq.Expressions.LambdaExpression.Compile () [0x00000]
at Db4objects.Db4o.Linq.Expressions.SubtreeEvaluator.EvaluateCandidate (System.Linq.Expressions.Expression expression) [0x00000]
...
But falling back from this syntax:
var items = from RosterItem r in db
where r.CreatedAt > DateTime.Now.Subtract(TimeSpan.FromMinutes(10))
select r;
to the NativeQuery syntax (with delegates replaced by lambda's):
var items = db.Query<RosterItem>(r => r.CreatedAt > DateTime.Now.Subtract(TimeSpan.FromMinutes(10)));
It's still a fairly compact and straight forward syntax, so until i finish setting up my own Centos mono RPMs, i'll stick to this syntax.
I need to run db4o through some more serious paces, but I like what I see so far.
Labels: .net, c#, db4o, linq