mindtouch

Sharing data without sharing data state

I'm taking a break from Promise for a post or two to jot down some stuff that I've been thinking about while discussing future enhancements to MindTouch Dream with @bjorg. In Dream all service to service communication is done via HTTP (although the traffic may never hit the wire). This is very powerful and flexible, but also has performance drawbacks, which have led to many data sharing discussions.

Whether you are using data as a message payload or even just putting data in a cache, you want sender and receiver to be unable to see each others interaction with that data, which would happen if the data was a shared, mutable instance. If you were to allow shared modification on purpose or on accident can have very problematic consequences:

  1. Data corruption: Unless you wrap the data with a lock, two threads could try to modify the data at the same time
  2. Difference in distributed behavior: As soon as the payload crosses a process boundary, it ceases to be shared so changing topology, changes data behavior

There are a number of different approaches for dealing with this, each a trade-off in performance and/or usability. I'll use caching as the use case, since it's a bit more universal than message passing, but the same patterns applies.

Cloning

A naive implementation of a cache might just be a dictionary. Sure, you've wrapped the dictionary access with a mutex, so that you don't get corruption accessing the data. But multiple threads would still have access to the same instance. If you aren't aware of this sharing, expect to spend lots of time trying to debug this behavior. If you are unlucky it's not causing crashes but causes strange data corruption that you won't even know about until your data is in shambles. If you are lucky the program crashes because of an access violation of some sort.

Easy, we'll just clone the data going into the cache. Hrm, but now two threads getting the value are still messing with each other. Ok, fine, we'll clone it coming out of the cache. Ah, but if the orignal thread is still manipulating its copy data while others are getting the data, the cache keeps changing. That kind of invalidates the purpose of caching data.

So, with cloning we have to copy the data going in and coming back out. That's quite a bit of copying and in the case that the data goes into the cache and expires before someone uses it, it's a wasted copy to boot.

Immutability

If you've paid any attention to concurrency discussions you've heard the refrain from the functional camp that data should be immutable. Every modfication of the data should be a new copy with the orginal unchanged. This is certainly ideal for sharing data without sharing state. It's also a degenerative version of the cloning approach above, in that we are constantly cloning, whether we need to or not.

Unless your language supports immutable objects at a fundamental level, you are likely to be building this by hand. There's certainly ways of mitigating its cost, using lazy cloning, journaling, etc. i.e. figuring out when to copy what in order to stay immutable. But likely you are going to be building a lot of plumbing.

But if the facilities exist and if the performance characteristics are acceptable, Immutability is the safest solution.

Serialization

So far I've ignored the distributed case, i.e. sending a message across process boundaries or sharing a cache between processes. Both Cloning and Immutability rely on manipulating process memory. The moment the data needs to cross process boundaries, you need to convert it into a format that can be re-assembled into the same graph, i.e. you need to serialize and deserialize the data.

Serialization is another form of Immutability, since you've captured the data state and can re-assemble it into the original state with no ties to the original instance. So Serialization/Deserialization is a form of Cloning and can be used as an engine for immutability as well. And it goes across the wire? Sign me up, it's all i need!

Just like Immutability, if the performance characteristics are acceptable, it's a great solution. And of course, all serializers are not equal. .NET's default serializer, i believe, exists as a schoolbook example of how not to do it. It's by far the slowest, biggest and least flexible ones. On other end of scale, google's protobuf is the fastest and most compact I've worked with, but there are some flexibility concessions to be made. BSON is a decent compromise when more flexibility is needed. A simple, fast and small enough serializer for .NET that i like is @karlseguin's Metsys.Little. Regardless of serializer, even the best serializer is still a lot slower than copying in-process memory, never mind not even having to copy that memory.

Freeze

It would be nice to avoid the implicit copies and only copy or serialize/deserialize when we need to. What we need is for a way for the originator to be able to declare that no more changes will be made to the data and for the receivers of the data to declare whether they intend to modify the retrieved data, providing the folowing usage scenarios:

  • Originator and receiver won't change the data: same instance can be used
  • Originator will change data, receiver won't: need to copy in, but not coming out
  • Originator won't change the data, receiver will: can put instance in, but need to copy on the way out

In Ruby, freeze is a core language concept (although I profess my ignorance of not knowing how to get a mutable instance back again or whether this works on object graphs as well.) To let the originator and receiver declare their intended use of data in .NET, we could require data payloads to implement an interface, such as this:

public interface IFreezable<T> {
  bool IsFrozen { get; }

  void Freeze(); // freeze instance (no-op on frozen instance)
  T FreezeDry(); // return a frozen clone or if frozen, the current instance
  T Thaw();      // return an unfrozen clone (regardless whether instance is frozen)
}

On submitting the data, the container (cache or message pipeline) will always call FreezeDry() and store the returned instance. If the originator does not intend to modify the instance submitted further, it can Freeze() it first, turning the FreezeDry() that the container does into a no-op.

On receipt of the data, the instance is always frozen, which is fine for any reference use. But should the receiver need to change it for local state tracking, or submitting the changed version, it can always call Thaw() to get a mutable instance.

While IFreezable certainly offers some benefits, it'd be a pain to add to every data payload we want to send. This kind of plumbing is a perfect scenario for AOP, since its a concern of the data consumer not of the data. In my next post, I'll talk about some approaches to avoid the plumbing. In the meantime, the WIP code for that post can be found on github.

About Concurrent Podcast #3: Coroutines

Posted a new episode of the Concurrent Podcast over on the MindTouch developer blog. This time Steve and I delve into Coroutines, a programming pattern we use extensively in MindTouch 2009 and one that i’m also trying out as an alternative to my actor based Xmpp code in Notify.me.

Since there isn’t a native coroutine framework in C#, we’re using the one provided by MindTouch Dream. It’s built on top of the .NET iterator pattern (i.e. IEnumerable and yield) and makes the assumption that all Coroutines are asynchronous methods using Dream’s Result<T> object for coordinating the producer and consumer of a return values. Steve’s previously blogged about Result. Since those posts there’s also been a lot of performance improvements and capability improvements to Result committed to trunk, primarily providing robust cancellation with resource cleanup callbacks. For background on coroutines, you can also check out previous posts I’vee written.

The cool thing about asynchronous coroutines compared to an actor model is that call/response based actions can be written as a single linear block of code, rather than separate message handlers whose contiguous flow can only be determined by examining the message dispatcher. With a message dispatcher that can correlate message responses with suspended coroutines, sending and waiting for a message in a coroutine can be made to look like a method call without blocking the thread, which, especially with message passing concurrency, is vital, since a response isnn’t in any way guaranteed to happen.

I’m due to write another post on how to use Dream’s coroutine framework, but in the meantime i highly recommend checking out Dream from mindtouch’s svn. Lot’s of cool concurrency stuff in there. trunkis under heavy development, as we work towards Dream profile 2.0, but 1.7.0 is stable and production proven.

Concurrent Podcast and Producer/Consumer approaches

As usual, I’ve been blogging over on the MindTouch Developer blog, and since the topics i post about over there have a pretty strong overlap with what I’d post here, I figured i might as well start cross-posting about it here.

Aside from various technical posts, Steve Bjork and I have started recording a Podcast about concurrent programming. It’s currently 2 episodes strong, with a third one coming soon. Information on past and future posts can always be found here.

Today’s post on the MindTouch dev blog is about the producer/consumer pattern and how i moved from using dedicated workers with a blocking queue to using Dream’s new ElasticThreaPool to dispatch work.

Blogging on MindTouch Dev Blog

Once again, there’s been extended silence over here. I have several article drafts that keep getting the short end of my time in favor of coding. In the meantime, I have blogged a couple of article’s over on the MindTouch Dev blog.

In general, I’ve just been spending a lot of time working on RESTful services, both for MindTouch and designing the new Notify.me REST API.

By arne on | Uncategorized | A comment?
Tags: , , ,

Yet another blog to divert my attention

I’m now posting on yet another blog, diverting my attention from here. Just added a post on the MindTouch Developers blog about the upcoming pubsub/event system for Deki.

By arne on | Uncategorized | A comment?
Tags: ,

Dream access control

Just finished an article over on the MindTouch blog about tweaking Dream’s default access patterns. I really like how Dream uses cookies, something you don’t often see in REST services. Generally it’s all about X-My-Cool-Auth-Header business, which is yet another manual burden for developers. Not sure if this originated because people did raw http requests and either didn’t know that most http request mechanisms have cookie support (even curl has a cookie jar), or whether it was a dislike of cookies.

The article also briefly touches on Prologues and Epilogues, a topic I need go into with more detail some time in the future. Basically every Feature call can have n pre and post actions that can do anything from checking authentication to mutating the request (think accepting data in json or Xml and having a prologue and epilogue do transformations on the way in and out so that the feature itself doesn’t have to worry about the data format but can assume that it always gets Xml. The system kind of reminds me of apache handler chaining from mod_perl.

By arne on | Uncategorized | A comment?
Tags: , , ,

Dream for REST and async

I’ve been doing a lot of work inside of MindTouch Dream as of late over at MindTouch and i’m really digging it. Steve’s put together an awesome framework for doing asynchronous programming on .NET and for being able to treat all access as RESTful resources in your server side code.

Now, coming from a very Dependency Injection heavy design philosophy, Dream has been a bit of an adjustment for me, but the capabilities of Dream, especially the coroutine approach for dealing with requests, is very powerful and fairly intuitive, once you get your head around it.

In an effort to ease the Dream learning curve and cement my understanding of the code base, I’ll be blogging articles about it as I go along, and cross-posting them to the MindTouch developer wiki as well. My first article was a continuation of Steve’s Asynchronicity library series, this one about coroutines (read: yield) in Dream.

I’ve been using the C# Web Server project for my REST work up until recently, but I’m currently in the process of migrating it over to Dream. It just removes all the legwork and fits much better into the async workflow of the rest of notify.me.

Clearly I am biased, but seriously, if you need to build REST interfaces in .NET, Dream beats anything you can roll on your own in a reasonable amount of time, and definitely is about 1000% more powerful than trying to force WCF down the REST or even POX path.

A case for TDD

I know, it’s been rather quiet here lately. I’ve just been slammed with coding, so writing things up is falling behind. In addition, my blogging time’s going to be split between techblog.notify.me and www.mindtouch.com/blog. And I’m behind on both of those as well. I should have some fun Dream and DekiScript stuff for the mindtouch blog and some asynchronous programming for the notify.me blog soon. As soon as i can get myself to stop coding again.

So what makes me stop coding for a minute to babble on? It’s just a quick case studio of why TDD is important.

I’ve been an Inversion of Control/Dependency Injection for about a year and a half, and while I’ve eased my way into it, I’m pretty much at the “an interface for every class” stage of having everything abstracted so i can easily mock things. But here and there, I take in third party assemblies for my projects. And most of the time, they are not well interfaced. And generally I try to create a facade that is interfaced, so i can test my interaction in isolation. But depending on how many secondary classes their code uses, sometimes my facade gets lazy leaving places i can’t mock.

Now, i’m pretty religious about test coverage, but i do have holes where my facade leaves untestable bits. And this is where TDD shows it’s worth. Because when a feature is added or a refactor happens, almost with 100% certainty, the bugs that manage to get into production are in the code that doesn’t have test coverage.

The lesson here is that the time saved in not building a properly mockable facade, thereby torpedoing my testability, is repaid manyfold in debugging later as bugs make it into production. meh.

By arne on | .net, geek | A comment?
Tags: , , , ,