Skip to content

net

Type-safe actor messaging approaches

For notify.me I hand-rolled a simple actor system to handle all Xmpp traffic. Every user in the system has its own actor that maintains their xmpp state, tracking online status, resources, resource capability, notification queues and command capabilities. When a message comes in either via our internal notification queues or from the user, a simple dispatcher sends the message on to the actor which handles the message and responds via a message that the dispatcher either hands off to the Xmpp bot for formatting and delivery to the client or sends it to our internal queues for propagation to other parts of the notify.me system.

This has worked flawlessly for over 2 years now, but its ad-hoc nature means it's a fairly high touch system in terms of extensibility. This has led me down building a more general actor system. Originally Xmpp was our backbone transport among actors in the notify.me system, but at this point, I would like to use Xmpp only as an edge transport, and otherwise use in-process mailboxes and serialize via protobuf for remote actors. I still love the Xmpp model for distributing work, since nodes can just come up anywhere, sign into a chatroom and report for work. You get broadcast, online monitoring, point-to-point messaging, etc. all for free. But it means all messages go across the xmpp backbone, which has a bit of overhead and with thousands of actors, i'd rather stay in process when possible. No point going out to the xmpp server and back just to talk to the actor next to you. I will likely still use Xmpp for Actor Host nodes to discover each other, but the actual inter-node communication will be direct Http-RPC (no, it's not RESTful, if it's just messaging).

Definining the messaging contract as an Interface

One design approach I'm currently playing with is using actors that expose their contract via an interface. Keeping the share-nothing philosophy of traditional actors, you still won't have a reference to an actor, but since you know its type, you know exactly what capabilities it has. That means rather than having a single receive point on the actor and making it responsible for routing the message internally based on message type (a capability that lends itself better to composition), messages can arrive directly at their endpoints by signature. Another benefit is that testing the actor behavior is separate from its routing rules.

public interface IXmppAgent {
    void Notify(string subject, string body);
    OnlineStatus QueryStatus();
}

Given this contract we could just proxy the calls. So our mailbox could have a proxy factory like this:

public interface IMailbox {
    TRecipient For<TRecipient>(string id);
}

allowing us to send messages like this:

var proxy = _mailbox.For<IXmppAgent>("foo@bar.com");
proxy.Notify("hey", "how'd you like that?");
var status = proxy.QueryStatus();

But messaging is supposed to be asynchronous

While this is simple and decoupled, it is implictly synchronous. Sure .Notify could be considered a fire-and-forget message, .QueryStatus definitely blocks. And if we wanted to communicate an error condition like not finding the recipient, we'd have to do it as an exception, moving errors into the synchronous pipeline as well. In order to retain the flexibility of a pure message architecture, we need a result handle that let's us handle results and/or errors via continuation.

My first pass at an API for this resulted in this calling convention:

public interface IMailbox {
    void Send<TRecipient>(string id, Expression<Action<TRecipient>> message);
    Result SendAndReceive<TRecipient>(string id, Expression<Action<TRecipient>>  message);
    Result<TResponse> SendAndReceive<TRecipient, TResponse>(
        string id,
        Expression<Func<TRecipient, TResponse>>  message
    );
}

transforming the messaging code to this:

_mailbox.Send<IXmppAgent>("foo@bar.com",a => a.Notify("hey", "how'd you like that?"));
var result = _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
    "foo@bar.com",
    a => a.QueryStatus()
);

I'm using MindTouch Dream's Result<T> class here, instead of Task<T>, primarily because it's battle tested and I have not properly tested Task under mono yet, which is where this code has to run. In this API, .Send is meant for fire-and-forget style messaging while .SendAndReceive provides a result handle -- and if Void were an actual Type, we could have dispensed with the overload. The result handle has the benefit of letting us choose how we want to deal with the asynchronous response. We could simply block:

var status = _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus())
    .Wait();
Console.WriteLine("foo@bar.com status:", status);

or we could attach a continuation to handle it out of band of the current execution flow:

_mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus()
    )
    .WhenDone(r => {
        var status = r.Value;
        Console.WriteLine("foo@bar.com status:", status);
    });

or we could simply suspend our current execution flow, by invoking it from a coroutine:

var status = OnlineStatus.Offline;
yield return _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus()
    )
    .Set(x => status = x);
Console.WriteLine("foo@bar.com status:", status);

Regardless of completion strategy, we have decoupled the handling of the result and error conditions from the message recipient's behavior, which is the true goal of the message passing decoupling of the actor system.

Improving usability

Looking at the signatures there are two things we can still improve:

  1. If we send a lot of messages to the same recipient, the syntax is a bit repetitive and verbose
  2. Because we need to specify the recipient type, we also have to specify the return value type, even though it should be inferable

We can address both of these, by providing a factory method for a typed mailbox:

public interface IMailbox {
    IMailbox<TRecipient> To<TRecipient>(string id);
}

public interface IMailbox<TRecipient> {
    void Send(Expression<Action<TRecipient>> message);
    Result SendAndReceive<TResponse>(Expression<Action<TRecipient>>  message);
    Result<TResponse> SendAndReceive<TResponse>(
        Expression<Func<TRecipient, TResponse>>  message
    );
}

which let's us change our messaging to:

var actorMailbox = _mailbox.To<IXmppAgent>("foo@bar.com");
actorMailbox.Send(a => a.Notify("hey", "how'd you like that?"));
var result2 = actorMailbox.SendAndReceive(a => a.QueryStatus());

// or inline
_mailbox.To<IXmppAgent>("foo@bar.com")
    .Send(a => a.Notify("hey", "how'd you like that?"));
var result3 = _mailbox.To<IXmppAgent>("foo@bar.com")
    .SendAndReceive(a => a.QueryStatus());

I've included the inline version because it is still more compact than the explicit version, since it can infer the result type.

Supporting Remote Actors

The reason the mailbox uses Expression instead of raw Action and Func is that at any point an actor we're sending a message to could be remote. The moment we cross process boundaries, we need to serialize the message. That means we need to be able to programatically inspect the inspection, and build a serializable AST as well as serialize the captured data members used in the expression.

Since we're talking serializing, inspecting the expression also allows us to verify that all members are immutable. For value types, this is easy enough, but DTOs would need be prevented from changing so that local vs. remote invocation won't end up with different result just because the sender changed it's copy. We could handle this via serialization at message send time, although this looks like a perfect place to see how well the Freezable pattern works.

Porting ASP.NET MVC to Ruby on Rails

This isn't yet another .NET developer defecting to Ruby. I have very little interest in making Ruby my primary language. I've done a couple of RoR projects over the years, nothing serious I admit, but I just don't seem to enjoy it in the way that so many of my peers do. That said, RoR does hit a sweetspot for websites. The site I'm porting has very little in terms of business logic -- it's primarily HTML templating with navigation -- so this was an exercise to circumvent my mod_mono issues.

I'm a huge C# fanboy, but having worked with ASP.NET MVC for a while I have to admit that the amount of cruft one has to assemble to stay DRY in ASP.NET templating is just not worthwhile. While views can be strongly typed, it's an exercise in frustration trying to write templates generically. Maybe this becomes easier with dynamic usage in MVC3, but i haven't checked it out. What certainly doesn't help is that the MVC team decided to make TemplateHelper internal, turning the addition of helpers in the vein of .DisplayFor or .EditorFor into a major task that still ends up being a pile of hacks. Now I'm not an ASP.NET MVC expert and there's probably a lot of extension points I just don't know about. But the articles on extending it that I have found are usually pages of code. I shouldn't have to become a framework internals expert just to add some generic templating extensibility.

Ok, enough ranting. ASP.NET MVC is still a huge improvement over webforms, but right now I'm watching Manos de Mono and OWIN to see what develops in .NET land for websites there. The ASP.NET stack, in my opinion, is just too heavy for something that should be simple.

So, why RoR instead of node.js, since I claimed that I was going to get serious about javascript this year? Mostly because this port has a deadline, so use what you know applies, and it's a production site, so use known stable tech applies. Another benefit was that RoR uses the same <% %> syntax as webforms views and MVC was clearly heavily inspired by RoR.

I ported the site over 3 nights, maybe 10 hours of cumulative seat time which feels like time well spent. Strategic search and replace got me 80% there, faking Html. for my custom extension in RoR got me another 10%, leaving only 10% for actual new business logic written in ruby. Once I get to more complex business logic for the site I may stick to Ruby, although I know I'll be sorely tempted to write it as REST services in C# on top of Dream.

More on .ToList() vs. .ToArray()

Like my last post, "Materializing an Enumerable" this may be a bit academic, but as a linq geek, whether I should use .ToList() or .ToArray() is something the piques my curiosity. Most of the time when I return IEnumerable<T> i want it to be in a threadsafe manner, i.e. i don't want the list to change underneath the iterator, so I return a unique copy. For this I have always used .ToArray(), since it's immutable and I figured it was leaner.

Finally having challenged this assumption, it turns out that .ToList() is theoretically faster for sources that are not ICollection<T>. When the final count is known, as is the case with ICollection<T>, both .ToList() and .ToArray() create an array under the hood for storage that is sufficiently large and copy the source into the destination. When the count isn't known however, both allocate an array and write to it, copying the contents to a larger array anytime the size is exceeded. So far, both are nearly identical in execution. However, once the end of the source is reached, .ToList() is done, while .ToArray() does one more copy to return a properly sized array. Of course, the overhead of iterating on that source, which is more than likely hitting some I/O or Computation barrier, means that in terms of measurable performance difference, again, both are identical.

It is still true that a List<T> object uses more memory than an T[], but even that difference is almost always going to be irellevant as the collections size is insignificant compared to the items it contains. That means that using .ToList() or .ToArray() to create an IEnumerable<T> is really a matter of personal preference.

Materializing an Enumerable

Yesterday I posted the question "Is there a way to Memorize or Materialize an IEnumerable?" on stackoverflow, hoping that there was already a built in way in the BCL. The answers and comments showed, that there wasn't but also challenged my existing assumptions as well as illustrated that materializing and/or memorizing could be interpreted in a number of ways. I figured that amount of ambiquity required a deeper dive into the subject.

What's this for anyway?

I try to use IEnumerable<T> as the return value for any method that is supposed to return a sequence meant purely for consumption. I choose IEnumerable<T> over an array or list because T[] exposes an unneeded implementation details while returning IList<T> or ICollection<T> allow modification of the sequence which is almost always undesirable behavior. And that doesn't even address that the enumerable might be a stream of items coming from an external source like a database cursor, a file stream or from executing a linq AST.

The drawback of this is that making multiple calls on an IEnumerable<T> that enumerate it under the hood may either incur a large cost, in the case of executing a linq AST repeatedly, or fail, in the case of stream or cursor. In order to be able to do something like the below, you really want to be certain that you have a finite sequence to query:

if(enumberable.Any() {
  foreach(var item in enumerable) {
    ...
  }
} else {
  ...
}

.Any() has to get an enumerator and call .MoveNext() once to see if it returns true and foreach, of course, gets the enumerator and iterates over it until the end. In order to safely write the above code, you really want the IEnumerable<T> converted into a computed collection.

The usual solution is to just call either .ToList() or .ToArray() and be done with it. But both have undesirable side-effects. Both will always create a new copy of the collection, which may have a non-insignifcant cost. And both change the type from IEnumerable<T>. Sure you can cast it back, but because both are not idempotent, casting to IEnumerable<T> hides the only clue that you don't want to call .ToList()/.ToArray() again. In addition, .ToList() also produces a mutable collection.

Most of the time, none of these side-effects are significant detractors, but should you return the memorized version from a method, you probably would want to cast it back to IEnumerable<T> and then the cost of this behavior can start to add up. Having a method that lets you memorize or materialize in an idempotent fashion would be useful.

Memorize()

What is the expected behavior of .Memorize()? It should capture the current state of sequence at the time of call and return an immutable sequence and it should force that sequence into memory so that multiple enumerations are relatively cheap. This one is fairly simple to implement:

public static IEnumerable Memorize(this IEnumerable enumerable) { return enumerable.GetType().IsArray ? enumerable : enumerable.ToArray(); }

Arrays are already immutable sequences, so we can use them reliably as our memorized collection. And if the source already is an array, we can safely return it unmodified. Now we can pass the resultant enumerable arround without concern that someone else calling .Memorize() again needlessly copies it.

Materialize()

Unlike .Memorize(), .Materialize() does not imply that the enumerable becomes a private, immutable copy. It only wants to make certain that the type can be safely enumerated. This lesser requirement actually complicates the idempotency scenario, requiring a internmediate collection class to be created:

public static class LinqEx {

    public static IEnumerable<T> Materialize<T>(this IEnumerable<T> enumerable) {
        if(enumerable is MaterializedEnumerable<T> || enumerable.GetType().IsArray) {
            return enumerable;
        }
        return new MaterializedEnumerable<T>(enumerable);
    }

    private class MaterializedEnumerable<T> : IEnumerable<T> {
        private readonly ICollection<T> _collection;
        public MaterializedEnumerable(IEnumerable<T> enumerable) {
            _collection = enumerable as ICollection<T> ?? enumerable.ToArray();
        }

        public IEnumerator<T> GetEnumerator() {
            return _collection.GetEnumerator();
        }

        IEnumerator IEnumerable.GetEnumerator() {
            return GetEnumerator();
        }
    }
}

The purpose of MaterializedEnumerable<T> is as marker for a previous materialization that can wrap or coerce a collection, so that no unnecessary copying is done.

A word on the use of .ToArray() instead of .ToList(): I've always leaned towards .ToArray(), both because it creates an immutable collection and because I thought arrays to be more lightweight than lists. After cracking them both open in Reflector, it became apparent that they should be about the same and confirmed that there is no significant difference with some simple tests.

While memorize and materialize have subtly different meaning, both intending to optimize access to an enumerable idempotently, in day to day use simply using .ToArray() will usually be just fine.

Func/Action vs. Delegate

A while back I wrote that you really never have to write another delegate again, since any delegate can easily be expressed as an Action or Func. After all what's preferable? This:

var work = worker.ProcessTaskWithUser(delegate(Task t, User u) {
  // define the work callback
});

or this:

var work = worker.ProcessTaskWithUser((t, u) => {
  // define the work callback
});

I know I prefer lambda's over delegates. But this is just on the consuming end. The signature for the above could be either:

delegate Task TaskUserDelegate(Task inputTask, User contextUser);
IEnumerable<Task> ProcessTaskWithUser( TaskUserDelegate processCallback );

or:

IEnumerable<Task> ProcessTaskWithUser( Func<Task,User,Task> processCallback );

Either one can be used with the same lambda, so using the delegate doesn't inconvenience us in consumption. But writing the Func version is certainly more concise so it seems like the winner once again. But In terms of consumption of that API, we've lost the signature of the method which would explain what each parameter is used for. Sure, .Where(Func<T,bool> filter) is pretty self-explanatory, but .WhenDone(Func<T,V,string,T> callback) really doesn't tell us much of anything.

So there seems to be straight forward usability rule of thumb: Use a delegate if the parameter's meaning isn't obvious from the usage of the lambda. But if the goal here is to make it easier for the consumer of the API, unfortunately it's not that simple, since the primary tool for communicating the API's documentation, intellisense, actually makes things worse.

Usability of delegate

For maximum usability, let's document the the API so it's meaning is discoverable:

/// <summary>
/// The task user delegate is meant to transform a given task into a new one in the context of a user.
/// </summary>
/// <param name="inputTask">The task to transform.</param>
/// <param name="activeUser">The user context to use for the transform.</param>
/// <returns>A new task in the user's context.</returns>
delegate Task TaskUserDelegate(Task inputTask, User activeUser);

/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUser(TaskUserDelegate processCallback) {
    //...
}

And this is what it looks like on code completion:

While TaskUserDelegate is well documented, this does not get exposed via intellisense. Worse, this signature tells us nothing about the arguments for our lambda. So, yeah, we created a better documented API, but made it's discovery worse.

Usability of func

Now, let's do the same for the func signature:

/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUserx(Func<Task, User, Task> processCallback) {
    //...
}

which gives us this completion:

Now we at least know the exact signature of the lambda we're creating, even if we don't know what the purpose of the arguments is.

Best usability: Plain Old Documentation

In both cases, the best discoverability ends up being plain old textual documentation of the parameter and even though the delegate provides extra documentation possibilities, their access is not convenient that for expediency i'd still have to vote for the Func signature.

The one exception to the rule would be a lambda that is meant as a dependency. I.e. a class or method that has a callback that you attach for later use, rather than immediate dispatch. In that case the lambda really functions as a single method interface and should be treated like any other dependency and be as explicit as possible.

I need a pattern for extending a class by an unknown other

I'm currently splitting up some MindTouch Dream functionality into client and server assemblies and running into a consistent pattern where certain classes have augmented behavior when running in the server context instead of the client context. Since the client assembly is used by the server assembly, and these classes are part of the client assembly, they do not even know about the concept of the server context. Which leads me to the dilemma, how do I inject different behavior into these classes when they run under the server context?

Ok, the short answer to this is probably "you're doing it wrong", but let's just go with it and accept that this is what's happening. Let's also assume that the server context can be discovered by static means (i.e. a static singleton accessor) which also lives in the server assembly.

Here's what I've come up with: Extract the desired functionality into an interface and chain implementations together. Each class that needs this facility would have to create it's own interface and default implementation and looks something like this:

public interface IFooHandler {
  ushort Priority { get; }
  bool TryFoo(string input, out string output);
}

TryFoo gives the handler implementation a chance to look at the inputs and decide whether to handle it, or whether to pass on it. The usage of collection of handlers takes the following form:

public string Foo(string input) {
  string output = null;
  _handlers.Where(x => x.TryFoo(input, out output)).First();
  return output;
}

This assumes that _handlers is sorted by priority. It return the result of the first handler report true on invocation. Building up the _handlers happens in the static constructor:

static Bar() {
  _handlers = typeof(IUriTranslator)
    .DiscoverImplementors()
    .Instantiate()
    .Cast<IUriTranslator>()
    .OrderBy(x => x.Priority)
    .ToArray(); 
}

where DiscoverImplementors and Instantiate are extension methods i'll leave as an exercise to the reader.

Now the server assembly simply creates its implementation of IFooHandler, gives it a higher priority and on invocation checks its static accessor to see if it's running in the server context and if not, lets the chain fall through to the next (likely the default) implementation.

This works just fine. I don't really like the static discovery of handlers, and if it weren't for backwards compatibility, I'd move the injection of handlers into a factory class and leave all that wire-up to IoC. Since that's not an option, I think this is the most elegant and performant solution.

It still feels clunky, tho. Anyone have a better solutions for changing the behavior of a method in an existing class that doesn't required the change to be in a subclass (since the instance will be of the base type)? Is there a pattern i'm overlooking?

libBeanstalk.NET, a Beanstalkd client for .NET/mono

Image curtesy of jepeters74A couple of years back I wrote a store-and-forward message queue called simpleMQ for vmix. A year later, Vmix was kind enough to let me open source it and I put it up on sourceforge (cause that was the place back in the day). But it never got any documentation or promotion love from me, being much too busy building notify.me and using simpleMQ as its messaging backbone. Over those last couple of years, simpleMQ was served us incredibly well at notify.me, passing what must be billions of messages by now. But it does have warts and I have a backlog of fixes/features i've been meaning to implement.

Beanstalkd: simple & fast workqueue

Rather than invest more time on simpleMQ, i've been watching other message/work queues to see whether there is another product i could use instead. I've yet to find a product that i truly like, but Beanstalkd is my favorite of the bunch. Very simple, fast and with optional persistence, it addresses most of my needs.

Beanstalkd's protocol is inspired by memcached. It uses a persistent TCP connection, but relies on connection state only to determine which "tube" (read: workqueue) you are using and to act as a work timeout safeguard. The protocol is ASCII verbs with binary payloads and uses yaml for structured responses.

Tubes are created on demand and destroyed once empty. By default beanstalkd is in-memory only, but can use a binary log to store items and recover the in-memory state by log playback. I had briefly looked at zeromq, but after finding out that its speed relies on no persistence, I decided to give it a skip. zeromq might be web scale, but i prefer a queue that doesn't degrade to behaving like /dev/null :) Maybe my transactional RDBMS roots are showing, but I have a soft spot for at least journaling data to disk.

One concept of beanstalkd that i'm still conflicted about is that work is given a processing time-out (time-to-run) by the producer of the work, rather than having the consumer of the work declare its intended responsiveness. Since the producer doesn't know how soon the work gets picked up, i.e. time-to-run is not a measure of work item staleness, I don't see a great benefit for having the producer dictate the work terms.

The other aspect of work distribution beanstalkd lacks for my taste is the idea of being able to produce work in one instance and have it forwarded to another instance when that instance is available, i.e. store-and-forward queues. I like to keep my queues on the current host so i can produce work without having to rely on the uptime of the consumer or some central facility. However, store-and-forward is an implementation detail I can easily fake with a daemon on each machine that acts as a local consumer and distributor of work items, so it's not something i hold against beanstalkd.

libBeanstalk.NET

Notify.me being a mix of perl and C#, i needed a .NET client. A protocol complete one not existing and given the simplicity of the Beanstalkd protocol, I opted to write my own and have released it under Apache 2.0 on github.

I've not put DLLs up for downlad, since the API is still somewhat in flux as I continue to add features, but the current release supports the entire 1.3 protocol. By default it considers all payloads as binary streams, but I've included extension methods to handle simple string payloads:

  // connect to beanstalkd
  using(var client = new BeanstalkClient(host, port)) {

    // put some data
    var put = client.Put("foo");

    // reserve data from queue
    var reserve = client.Reserve();
    Console.Writeline("data: {0}", reserve.Data);

    // delete reserved data from queue
    client.Delete(reserve.JobId);
  }

The binary surface is just as simple:

  // connect to beanstalkd
  using(var client = new BeanstalkClient(host, port)) {

    // put some data
    var put = client.Put(100, 0, 120, data, data.Length);

    // reserve data from queue
    var reserve = client.Reserve();

    // delete reserved data from queue
    client.Delete(reserve.JobId);
  }

I've tried to keep the interface IBeanstalkClient to be a close as possible to the protocol verb signatures and rely on extension methods to create simpler versions on top of that interface. To facilitate extensions that provide smart defaults, the client also has an instance of a Defaults member that can be used to initialize those values.

The main deviation from the protocol is how I handle producer and consumer tubes. Rather than have a separate getter and setter for the tube that put will enter work into, I simply have a settable property CurrentTube. And rather than surfacing watch, ignore and listing of consumer tubes, the client includes a special collection, WatchedTubes, with the following interface:

interface IWatchedTubeCollection : IEnumerable<string> {
    int Count { get; }
    void Add(string tube);
    bool Remove(string tube);
    bool Contains(string tube);
    void CopyTo(string[] array, int arrayIndex);
    void Refresh();
}

I was originally going to use ICollection<string>, but Clear() did not make sense and I wanted to have a manual method to reload the list from the server, which is exposed via Refresh(). Under the hood, watched tubes is a hashset, so adding the same tube multiple times has no effect, neither is order of tubes in the collection guaranteed.

Future work

The client is functional and can do everything that Beanstalkd offers, but it's really just a wire protocol, akin to dealing with files as stream. To make this a useful API, it really needs to take the 90% use cases and remove any friction and repetition they would encounter.

Connection pooling

BeanstalkClient isn't, nor is it meant to be, thread-safe. It assumes you create a client when you need it and govern access to it, rather than sharing a single instance. This was motivated by Beanstalkd's behavior of storing tube state as part of the connection. Given that I encourage clients to be created on the fly to enqueue work, it makes sense that under the hood clients should use a connection pool both to re-use existing connections rather than constantly open and close sockets and to limit the maximum sockets a single process tries to open to Beanstalkd. Pooling wouldn't mean sharing of a connection by clients, but handing off connections to new clients and putting them in a pool to be closed on idle timeout once the client is disposed.

Most of this work is complete and on a branch, but i want to put it through some more testing before merging it back to master, especially since it will introduce client API changes.

Distributed servers

The Beanstalkd FAQ has this to say about distribution:

Does beanstalk inherently support distributed servers?

Yes, although this is handled by the clients, just as with memcached. The beanstalkd server doesn't know anything about other beanstalkd instances that are running.

I need to take a look at the clients that do implement this and determine what that means for me. I.e. do they use some kind of consistent hashing to determine which node to use for a particular tube, etc. But I do want to have parity with other clients on this.

POCO Producers and Consumers

For me, the 90% use case for a work queue is produce work on some threads/processes/machines and consume that work on a number of workers. Generally that item will have some structured fields describing the work to be done and producers and consumers will use designated tubes for specific types of work. These use cases imply that producers and consumers are separate user stories, that they are tied to specific tubes and deal with structured data. My current plan is to address these user stories with two new interfaces that will look similar to these:

public interface IBeanstalkProducer<T> {
  BeanstalkProducerDefaults Defaults { get; }
  PutResponse Put(T);
}

public interface IBeanstalkConsumer<T> {
  BeanstalkConsumerDefaults Defaults { get; }
  Job<T> Reserve();
  bool Delete(Job<T> job);
  Release(Job<T> job);
}

The idea with each is that it's tied to a tube (or tubes for the consumer) at construction time and that the implementation will have a simple way of associating a serializer for the entity T (will provide protobuf and MetSys.Little support out of the box).

Rx support via IObservable<Job<T>>

Once there is the concept of a Job<T>, it makes sense that reservation of jobs should be exposed as a stream of work that can be processed via link. Although since items should only be reserved when the subscriber accepts the work, it should probably be encapsulated in something like this:

public interface Event<T> {
  Job<T> Take();
  Job<T> Job { get; }
}

This way, multiple subscribers can try to reserve an item and items not reserved by anyone are released automatically.

As I work on the future work items, I will also use the library in production so i can get better educated about the real world behavior of Beanstalkd and what uncovered scenarios the client runs into. There is ok test coverage over the provided behavior but I certainly want to increase that signficantly as i keep working on it.

For the time being, I hope the library proves useful to other .NET developers and would love to get feedback, contributions and issues you may encounter.

Loading per solution settings with Visual Studio 2008

If you ever work on a number of projects (company, oss, contracting), you're likely familiar with coding style issues. At MindTouch we use a number of naming and formatting styles that are different from the Visual Studio defaults, so when I work on github and other OSS projects my settings usually cause formatting issues. One way to address this is to use ReSharper and its ability to store per solution settings. But I still run into some formatting issues with Visual Studio settings that are not being overriden by ReSharper, especially when using Ctrl-K, Ctrl-D, which has become a muscle memory keystroke for me.

While Visual Studio has only per install global settings, it does at least let you import and export them. You'd figure that that could be a per solution setting, but after looking around a bit and getting the usual re-affirmation that Microsoft Connect exists purely to poke developers in the eye, i only found manual or macro solutions. So i decided to write a Visual Studio 2008 Add-in to automate this behavior.

Introducing: SolutionVSSettings Add-in

The goal of the Add-in is to be able to ship your formatting rules along with your solution (which is why I also highly recommend using ReSharper, since you can set up naming conventions and much more). I wanted to avoid any dialogs or user interaction requirements with the process, but wanted to leave open options for overriding settings in case you do use the Add-in but have one or two projects you don't want accept settings for or want to have a default setup you want to use without a per solution setting. The configuration options are listed in order of precedence:

Use the per solution solutionsettings.config config xml file

The only option for the config file right now is an absolute or relative path (relative to the solution) to the .vssettings file to load as the solution is loaded. You can check the config file in with your solution, or keep it as a local file ignored by source control and point to a settings file in the solution or a common one somewhere else on your system. Currently the entirely of configuration looks like this:

<config>
  <settingsfile>{absolute or relative path to settingsfile}</settingfile>
</config>

The purpose of this method, even though it is the highest precendence, is to easily set up an override for a project that already has a settings file that the Add-In would otherwise load.

If no solutionsettings.config is found, the Add-in will look for a solution item named 'solution__.vssettings' and load it as the solution settings. Since this file is part of the solution and will be checked in with the code, this is the recommended default for sharing settings.

Use environment variable 'solutionsettings.config'

Finally, if no settings are found by the other methods, the Add-in will look for an environment variable 'solutionsettings.config' to find an absolute or relative path (relative to the solution) to a config file (same as above) from which to get the settings file path. This is particularily useful if you have a local standard and don't include it in your own solutions, but need to make sure that local standard is always loaded, even if another solution previously loaded its own settings.

How does it work?

The workhorse is simply a call to:

DTE2.ExecuteCommand("Tools.ImportandExportSettings", "/import:{settingfile}");

The rest is basic plumbing to set up the Add-in and find the applicable settings file.

The Add-In subclasses IDTExtensibility2 and during initialization subscribes itself to receive all solution loaded events:

public void OnConnection(object application, ext_ConnectMode connectMode, object addInInst, ref Array custom) {
    _applicationObject = (DTE2)application;
    _addInInstance = (AddIn)addInInst;
    _debug = _applicationObject.ToolWindows.OutputWindow.OutputWindowPanes.Add("Solution Settings Loader");
    Output("loaded...");
    _applicationObject.Events.SolutionEvents.Opened += SolutionEvents_Opened;
    Output("listening for solution load...");
}

I also setup an OutputWindow to report what the Add-In is doing. OutputWindows are a nice, unobtrusive ways in Visual studio to report status and not be seen unless the user cares.

The event handler for solutions being opened does the actual work of looking for possible settings and if one is found to load it:

void SolutionEvents_Opened() {
    var solution = _applicationObject.Solution;
    Output("loaded solution '{0}'", solution.FileName);

    // check for solution directory override
    var configFile = Path.Combine(Path.GetDirectoryName(solution.FileName), "solutionsettings.config");
    string settingsFile = null;
    if(File.Exists(configFile)) {
        Output("trying to load config from '{0}'", configFile);
        settingsFile = GetSettingsFile(configFile, settingsFile);
        if(!string.IsNullOrEmpty(settingsFile)) {
            Output("unable to find override '{0}'", settingsFile);
        } else {
            Output("using solutionsettings.config override");
        }
    }

    // check for settings in solution
    if(string.IsNullOrEmpty(settingsFile)) {
        var item = _applicationObject.Solution.FindProjectItem(SETTINGS_KEY);
        if(item != null) {
            settingsFile = item.get_FileNames(1);
            Output("using solution file '{0}'", settingsFile);
        }
    }

    // check for environment override
    if(string.IsNullOrEmpty(settingsFile)) {
        configFile = Environment.GetEnvironmentVariable("solutionsettings.config");
        if(!string.IsNullOrEmpty(configFile)) {
            settingsFile = GetSettingsFile(configFile, settingsFile);
            if(string.IsNullOrEmpty(settingsFile)) {
                Output("unable to find environment override '{0}'", settingsFile);
            } else {
                Output("using environment config override");
            }
        }
    }
    if(string.IsNullOrEmpty(settingsFile)) {
        Output("no custom settings for solution.");
        return;
    }
    var importCommand = string.Format("/import:\\"{0}\\"", settingsFile);
    try {
        _applicationObject.ExecuteCommand("Tools.ImportandExportSettings", importCommand);
        Output("loaded custom settings\\r\\n");
    } catch(Exception e) {
        Output("unable to load '{0}': {1}", settingsFile, e.Message);
    }
}

And that's all there is to it.

More work went into figuring out how to build an installer than building the Add-In.... I hate MSIs. At least i was able to write the installer logic in C# rather than one of the more tedious extensibility methods found in InstallShield (the least value for your money I've yet to find in any product) or Wix (a vast improvment over other installers, but it's still the victim of MSIs.)

Installation, source and disclaimer

The source can be found under Apache license at GitHub, which also has an MSI for those just wishing to install it.

NOTE: The MSI and source code come with no express or implied guarantees. It may screw things up in your environment. Consider yourself warned!!

This Add-In does blow away your Visual Studio settings with whatever settings file is discovered via the above methods. So, before installing this you should definitely back up your settings. I've only tested it on my own personal setup, so it may certainly misbehave on someone else's setup. It's certainly possible it screws up your settings or even Visual Studio install. I don't think it will, but I certainly can't call this well tested across environment, so backup and use at your own risk.

IronRuby isn't dead, it's been set free on a farm upstate

Yesterday Jimmy Schementi posted his farewell to Microsoft and with it his thoughts on the future of IronRuby. This made it onto twitter almost immediately as "IronRuby is dead". This was shortly followed by a number of voices countering with, "No it's not, it's OSS, step up community!". I think that's rather missing the point.

Microsoft needs Ruby more than Ruby needs Microsoft

I'll even go as far as saying Ruby has no need for Microsoft. But Microsoft has a clear problem with attracting a community of passionate developers building the next generation of apps. There is a large, deeply entrenched community building enterprise apps that is going to stay loyal for a long time, but just as Office will not stay dominant as a monolithic desktop app, the future of development for Microsoft needs to be web based and there's isn't a whole lot of fresh blood coming in that way.

Fresh blood that is passionate and vocal is flocking to Ruby, Scala, Clojure, node.js, etc.Even the vocal developers on the MS stack are pretty much all playing with Ruby or another dynamic language in some capacity or other. Maybe you think that those people are just a squeaky wheel minority, and maybe you are right. But minority or not, they are people who shape the impressions that the next wave of newcomers sees first. It's the bleeding edge guys that pave the road others will follow.

Startup technology decisions are done via peer networks, not by evaluating vendor marketing messages. Instead of trying to attract and keep alpha geeks, Microsoft is pushing technologies like WebMatrix and Lightswitch, as if drag-n-drop/no-code development wasn't an already reviled stereotype of the MS ecosystem.

Some have said that IronRuby is not in the interest of Microsoft. It would just be a gateway drug that makes it even easier to jump the MS ship. Sorry, but that cat's long out of the bag. Ruby's simplicity and ecosystem make jumping ship already as easy as could be. And right now, once they jump ship, with no integration story, they will quickly loose any desire to hold on to their legacy stack.

IronRuby, while no panacea to these problems, at least offered a way for people considering .NET to not have to choose Ruby or .NET. And it had the potential to expose already devoted Ruby fans to what .NET could offer on top (I know that one is a much harder sell). If you look at the Ruby space, Ruby often benefits from having another language backing it up, which makes Ruby front-end with Scala back-end popular. And if IronRuby were competitive, being able to bridge a Ruby front-end with a C# or F# back-end and have the option to stay in-process is a story worth trying to sell.

Let the community foster IronRuby, that's what OSS is all about!

Ok, that sounds idealistic and lovely, but, um, what community are you talking about? The Ruby community? They're doing fine without IronRuby. The .NET community? Well, OSS on .NET is tiny compared to all other stacks.

Most OSS projects are 99% consumers with 1% actual comitters. And that's fine, those 99% consumers grow the community and with it the pool of one percenters increases as well. But that only works if the project has an appeal to grow the 99% pool. It is virtually impossible to reach the bulk of the .NET pool without a strong push from Microsoft. I'm always amazed how many .NET developers that I meet are completely oblivious to technology not eminating from Redmond. And these are not just people that stumbled into VB because their Excel macros didn't cut it anymore. There are good, smart developers out there that have been well served by Microsoft and have not felt the need to look outside. For a vendor that's a great captive audience, worth a lot of money, so it's been in the interest of Microsoft to keep it that way. But that also means that for a project to get momentum beyond alpha geeks on the MS stack, it's gotta be pushed by Microsoft. It needs to be a first-class citizen on the platform, built into Visual Studio, etc.

The IronRuby catch-22

IronRuby has the potential to draw fresh blood into the existing ecosystem, but won't unless it's already got a large momentum inside of the .NET ecosystem. Nobody is going to use IronRuby instead of Ruby because it's almost as good. You can't get that momemtum without leveraging the captive audience. And you can't leverage that captive audience without real support from within Microsoft. The ball's been in Microsoft's court, but apparently it rolled under a table and has been forgotten.

When cloning isn't faster than copy via serializer

Yesterday I removed Subzero's dependency on Metsys.Little. This wasn't because I have any problem with it. On the contrary, it's a great library and provides really easy serialization and deserialization. But since the whole point of Subzero is to provide very lightweight data passing, i figured cloning is preferable.

My first pass was purely a "get it working" attempt and I knew that I left a lot of performance on the table by reflecting over the type each time i cloned. Didn't think it would be this bad tho. I used two classes for my test, a Simple and a Complex object:

public class Simple {
    public int Id { get; set; }
    public string Name { get; set; }
}

public class Complex {
    public Simple Owner { get; set; }
    public IList<Simple> Friends { get; set; }
}

The results were sobering:

Simple Object:
  Incubator.Clone: 142k/sec
  BinaryFormatter:  50k/sec
  Metsys.Little:   306k/sec
  ProtoBuf:        236k/sec

Complext Object:
  Incubator.Clone: 37k/sec
  BinaryFormatter: 15k/sec
  Metsys.Little:   44k/sec
  ProtoBuf:        80k/sec

As expected, BinaryFormatter is the worst and my Incubator.Clone beats it. But clearly, I either need to do a lot of optimizing, or not bother with cloning, because Metsys.Little and ProtoBuf are far superior. My problem was reflection, so I took a look at MetSys.Litte's code, since it had to do the same things as clone + reading/writing binary. This lead me to "Reflection : Fast Object Creation" by Ziad Elmalki and "Dodge Common Performance Pitfalls to Craft Speedy Applications" by Joel Pobar, both of which provide great insight on how to avoid performance problems with Reflection.

The resulting values are

Simple Object:
  Incubator.Clone: 458k/sec

Complext Object:
  Incubator.Clone: 123k/sec

That's more like it. 1.5x over MetSys.Little on simple and 1.5x over Protobuf on complex. I still have to optimize the collection cloning logic, which should improve complex objects, since a complex, collection-less graph resulted in this:

Complext Object:
  Incubator.Clone: 226k/sec
  BinaryFormatter:  20k/sec
  Metsys.Little:   113k/sec
  ProtoBuf:         82k/sec

Metsys.Little pulled back ahead of ProtoBuf, but Incubator pulled ahead enough that it's overall ratio is now 2x. So, i should have some more performance to squeeze out of Incubator.

The common lesson from all this is that you really need to measure. Things that "ought to be fast" may turn out to be disappointing. But equally important, imho, is get things working simply first, then measure to see if optimizing is needed. Just as bad as assuming it's going to be fast is assuming that it's going to be slow and prematurely optimizing.