libBeanstalk.NET, a Beanstalkd client for .NET/mono

Image curtesy of jepeters74A couple of years back I wrote a store-and-forward message queue called simpleMQ for vmix. A year later, Vmix was kind enough to let me open source it and I put it up on sourceforge (cause that was the place back in the day). But it never got any documentation or promotion love from me, being much too busy building notify.me and using simpleMQ as its messaging backbone. Over those last couple of years, simpleMQ was served us incredibly well at notify.me, passing what must be billions of messages by now. But it does have warts and I have a backlog of fixes/features i've been meaning to implement.

Beanstalkd: simple & fast workqueue

Rather than invest more time on simpleMQ, i've been watching other message/work queues to see whether there is another product i could use instead. I've yet to find a product that i truly like, but Beanstalkd is my favorite of the bunch. Very simple, fast and with optional persistence, it addresses most of my needs.

Beanstalkd's protocol is inspired by memcached. It uses a persistent TCP connection, but relies on connection state only to determine which "tube" (read: workqueue) you are using and to act as a work timeout safeguard. The protocol is ASCII verbs with binary payloads and uses yaml for structured responses.

Tubes are created on demand and destroyed once empty. By default beanstalkd is in-memory only, but can use a binary log to store items and recover the in-memory state by log playback. I had briefly looked at zeromq, but after finding out that its speed relies on no persistence, I decided to give it a skip. zeromq might be web scale, but i prefer a queue that doesn't degrade to behaving like /dev/null :) Maybe my transactional RDBMS roots are showing, but I have a soft spot for at least journaling data to disk.

One concept of beanstalkd that i'm still conflicted about is that work is given a processing time-out (time-to-run) by the producer of the work, rather than having the consumer of the work declare its intended responsiveness. Since the producer doesn't know how soon the work gets picked up, i.e. time-to-run is not a measure of work item staleness, I don't see a great benefit for having the producer dictate the work terms.

The other aspect of work distribution beanstalkd lacks for my taste is the idea of being able to produce work in one instance and have it forwarded to another instance when that instance is available, i.e. store-and-forward queues. I like to keep my queues on the current host so i can produce work without having to rely on the uptime of the consumer or some central facility. However, store-and-forward is an implementation detail I can easily fake with a daemon on each machine that acts as a local consumer and distributor of work items, so it's not something i hold against beanstalkd.

libBeanstalk.NET

Notify.me being a mix of perl and C#, i needed a .NET client. A protocol complete one not existing and given the simplicity of the Beanstalkd protocol, I opted to write my own and have released it under Apache 2.0 on github.

I've not put DLLs up for downlad, since the API is still somewhat in flux as I continue to add features, but the current release supports the entire 1.3 protocol. By default it considers all payloads as binary streams, but I've included extension methods to handle simple string payloads:

  // connect to beanstalkd
  using(var client = new BeanstalkClient(host, port)) {

    // put some data
    var put = client.Put("foo");
 
    // reserve data from queue
    var reserve = client.Reserve();
    Console.Writeline("data: {0}", reserve.Data);
    
    // delete reserved data from queue
    client.Delete(reserve.JobId);
  }

The binary surface is just as simple:

  // connect to beanstalkd
  using(var client = new BeanstalkClient(host, port)) {

    // put some data
    var put = client.Put(100, 0, 120, data, data.Length);
 
    // reserve data from queue
    var reserve = client.Reserve();
    
    // delete reserved data from queue
    client.Delete(reserve.JobId);
  }

I've tried to keep the interface IBeanstalkClient to be a close as possible to the protocol verb signatures and rely on extension methods to create simpler versions on top of that interface. To facilitate extensions that provide smart defaults, the client also has an instance of a Defaults member that can be used to initialize those values.

The main deviation from the protocol is how I handle producer and consumer tubes. Rather than have a separate getter and setter for the tube that put will enter work into, I simply have a settable property CurrentTube. And rather than surfacing watch, ignore and listing of consumer tubes, the client includes a special collection, WatchedTubes, with the following interface:

interface IWatchedTubeCollection : IEnumerable<string> {
    int Count { get; }
    void Add(string tube);
    bool Remove(string tube);
    bool Contains(string tube);
    void CopyTo(string[] array, int arrayIndex);
    void Refresh();
}

I was originally going to use ICollection<string>, but Clear() did not make sense and I wanted to have a manual method to reload the list from the server, which is exposed via Refresh(). Under the hood, watched tubes is a hashset, so adding the same tube multiple times has no effect, neither is order of tubes in the collection guaranteed.

Future work

The client is functional and can do everything that Beanstalkd offers, but it's really just a wire protocol, akin to dealing with files as stream. To make this a useful API, it really needs to take the 90% use cases and remove any friction and repetition they would encounter.

Connection pooling

BeanstalkClient isn't, nor is it meant to be, thread-safe. It assumes you create a client when you need it and govern access to it, rather than sharing a single instance. This was motivated by Beanstalkd's behavior of storing tube state as part of the connection. Given that I encourage clients to be created on the fly to enqueue work, it makes sense that under the hood clients should use a connection pool both to re-use existing connections rather than constantly open and close sockets and to limit the maximum sockets a single process tries to open to Beanstalkd. Pooling wouldn't mean sharing of a connection by clients, but handing off connections to new clients and putting them in a pool to be closed on idle timeout once the client is disposed.

Most of this work is complete and on a branch, but i want to put it through some more testing before merging it back to master, especially since it will introduce client API changes.

Distributed servers

The Beanstalkd FAQ has this to say about distribution:

Does beanstalk inherently support distributed servers?

Yes, although this is handled by the clients, just as with memcached. The beanstalkd server doesn’t know anything about other beanstalkd instances that are running.

I need to take a look at the clients that do implement this and determine what that means for me. I.e. do they use some kind of consistent hashing to determine which node to use for a particular tube, etc. But I do want to have parity with other clients on this.

POCO Producers and Consumers

For me, the 90% use case for a work queue is produce work on some threads/processes/machines and consume that work on a number of workers. Generally that item will have some structured fields describing the work to be done and producers and consumers will use designated tubes for specific types of work. These use cases imply that producers and consumers are separate user stories, that they are tied to specific tubes and deal with structured data. My current plan is to address these user stories with two new interfaces that will look similar to these:

public interface IBeanstalkProducer<T> {
  BeanstalkProducerDefaults Defaults { get; }
  PutResponse Put(T);
}

public interface IBeanstalkConsumer<T> {
  BeanstalkConsumerDefaults Defaults { get; }
  Job<T> Reserve();
  bool Delete(Job<T> job);
  Release(Job<T> job);
}

The idea with each is that it's tied to a tube (or tubes for the consumer) at construction time and that the implementation will have a simple way of associating a serializer for the entity T (will provide protobuf and MetSys.Little support out of the box).

Rx support via IObservable<Job<T>>

Once there is the concept of a Job<T>, it makes sense that reservation of jobs should be exposed as a stream of work that can be processed via link. Although since items should only be reserved when the subscriber accepts the work, it should probably be encapsulated in something like this:

public interface Event<T> {
  Job<T> Take();
  Job<T> Job { get; }
}

This way, multiple subscribers can try to reserve an item and items not reserved by anyone are released automatically.

As I work on the future work items, I will also use the library in production so i can get better educated about the real world behavior of Beanstalkd and what uncovered scenarios the client runs into. There is ok test coverage over the provided behavior but I certainly want to increase that signficantly as i keep working on it.

For the time being, I hope the library proves useful to other .NET developers and would love to get feedback, contributions and issues you may encounter.

By arne on | .net, geek, mono | A comment?
Tags: , , ,

Maybe it’s time to stop pretending we buy software?

There's been a lot of noise about comments made by THQ's Cory Ledesma about used games. Namely,

"I don't think we really care whether used game buyers are upset because new game buyers get everything. So if used game buyers are upset they don't get the online feature set I don't really have much sympathy for them." -cvg

Well, this has gotten a lot of gamers upset, and my immediate reaction was something like "dude, you are just pissing off your customers." And while Cory may have been the one to say it out loud, actions by EA and others in providing free DLC only to the original buyer and similar original buyer incentives show that the industry in general agrees with his sentiments.

Holding steadfast to my first-sale doctrine rights, I, like most gamers, software and media purchasers, strongly believe that we can sell those bits we bought. Of course, EULAs have said nu-uh to that belief for just as long. We purchasers of bits only own a license to those bits, we don't own a product. But just as nobody reads an EULA, everybody believes those EULAs to be unenforcable. I own those bits, man!

So I continue to believe that when I purchase a product, let's say some bits on a DVD, i can sell it again or buy such a product from someone else. It wasn't until I read Penny Arcade earlier this week, that I had to admit that, first-sale doctrine notwithstanding, I am not their customer.

Penny Arcade - Words And Their Meanings
Penny Arcade – Words And Their Meanings

But, I thought, just like buying CDs used, I am actually contributing to a secondary market that promotes the brand of the artist. Buying that old CD used makes it more likely that I will buy the next one new, or that I will go to their show when they come to town, etc. Put aside whether this secondary market really has the magical future revenue effects i ascribe to it, for games there is no such secondary market. As Tycho said in his post accompanying the strip:

"If I am purchasing games in order to reward their creators, and to ensure that more of these ingenious contraptions are produced, I honestly can't figure out how buying a used game was any better than piracy. From the the perspective of a developer, they are almost certainly synonymous." – tycho, penny arcade

Ok, maybe you think the secondary market is sequels that you will buy new because you bought the original used. Never mind that most sequels are farmed out to another development house by the publisher, buying used games, at best, actively encourages the endless milking of sequels rather than new IP. But it's even worse for games, because virtually all games now include some multi-player component and keeping that running costs real money. You paying for Xbox Live doesn't mean the publisher isn't still paying more cash to Microsoft to run those servers. So every used Modern Warfare player costs the publisher money while only Gamestop made any cash on the sale. So, sure, you own that disk, but you're insane if you think that the developer/publisher owes you anything.

Now, let's extend this to the rest of the software market. Here you can argue a bit more for a secondary market, since software regularily comes out with new versions, encouraging you to pugrade. If you look at that boxed software revenue cycle it becomes clear that the added features and version revving just exist to extend a product into a recurring revenue stream. And if that's the motivation, it also means we're encouraging developers to spend less on quality and bug fixes (because nobody wants to pay for those), and more on bells and whistles, cause those justify the version rev and with it the upgrade price. In reality, if you use Photoshop professionally you've long ago stopped being a purchaser of boxed software and are instead a subscriber to the upgrade path.

This fickle revenue stream also has an effect on pricing. You may only use Powerpoint once in a while, but you paid to use it 24/7. Or maybe because you don't use it enough you've rationalized pirating it, which only serves to justify a high price tag, since the paying customers are subsidizing the pirates. Either way, the developer inflates the price to smooth out the revenue stream.

The sooner we stop pretending that we buy software and just admit that really we just want to rent it, the better. Being addicted to high retail prices, some publishers certainly will try to keep the same pricing as they move to the cloud, but the smart ones will adjust their pricing to attract those buyers who would never have bought the boxed version. Buying metered or by subscription has the potential for concentrating on excellence rather than bloat and the responsiveness and frequent updates of existing services seem to bear that promise out already. It's really in our favor to let go of idea of wanting a boxed product with a resale value.

By arne on | geek, rant | 2 comments
Tags: , , ,

Loading per solution settings with Visual Studio 2008

If you ever work on a number of projects (company, oss, contracting), you're likely familiar with coding style issues. At MindTouch we use a number of naming and formatting styles that are different from the Visual Studio defaults, so when I work on github and other OSS projects my settings usually cause formatting issues. One way to address this is to use ReSharper and its ability to store per solution settings. But I still run into some formatting issues with Visual Studio settings that are not being overriden by ReSharper, especially when using Ctrl-K, Ctrl-D, which has become a muscle memory keystroke for me.

While Visual Studio has only per install global settings, it does at least let you import and export them. You'd figure that that could be a per solution setting, but after looking around a bit and getting the usual re-affirmation that Microsoft Connect exists purely to poke developers in the eye, i only found manual or macro solutions. So i decided to write a Visual Studio 2008 Add-in to automate this behavior.

Introducing: SolutionVSSettings Add-in

The goal of the Add-in is to be able to ship your formatting rules along with your solution (which is why I also highly recommend using ReSharper, since you can set up naming conventions and much more). I wanted to avoid any dialogs or user interaction requirements with the process, but wanted to leave open options for overriding settings in case you do use the Add-in but have one or two projects you don't want accept settings for or want to have a default setup you want to use without a per solution setting. The configuration options are listed in order of precedence:

Use the per solution solutionsettings.config config xml file

The only option for the config file right now is an absolute or relative path (relative to the solution) to the .vssettings file to load as the solution is loaded. You can check the config file in with your solution, or keep it as a local file ignored by source control and point to a settings file in the solution or a common one somewhere else on your system. Currently the entirely of configuration looks like this:

<config>
  <settingsfile>{absolute or relative path to settingsfile}</settingfile>
</config>

The purpose of this method, even though it is the highest precendence, is to easily set up an override for a project that already has a settings file that the Add-In would otherwise load.

Recommended default: Include 'solution.vssettings' in Solution

If no solutionsettings.config is found, the Add-in will look for a solution item named 'solution.vssettings' and load it as the solution settings. Since this file is part of the solution and will be checked in with the code, this is the recommended default for sharing settings.

Use environment variable 'solutionsettings.config'

Finally, if no settings are found by the other methods, the Add-in will look for an environment variable 'solutionsettings.config' to find an absolute or relative path (relative to the solution) to a config file (same as above) from which to get the settings file path. This is particularily useful if you have a local standard and don't include it in your own solutions, but need to make sure that local standard is always loaded, even if another solution previously loaded its own settings.

How does it work?

The workhorse is simply a call to:

DTE2.ExecuteCommand("Tools.ImportandExportSettings", "/import:{settingfile}");

The rest is basic plumbing to set up the Add-in and find the applicable settings file.

The Add-In subclasses IDTExtensibility2 and during initialization subscribes itself to receive all solution loaded events:

public void OnConnection(object application, ext_ConnectMode connectMode, object addInInst, ref Array custom) {
    _applicationObject = (DTE2)application;
    _addInInstance = (AddIn)addInInst;
    _debug = _applicationObject.ToolWindows.OutputWindow.OutputWindowPanes.Add("Solution Settings Loader");
    Output("loaded...");
    _applicationObject.Events.SolutionEvents.Opened += SolutionEvents_Opened;
    Output("listening for solution load...");
}

I also setup an OutputWindow to report what the Add-In is doing. OutputWindows are a nice, unobtrusive ways in Visual studio to report status and not be seen unless the user cares.

The event handler for solutions being opened does the actual work of looking for possible settings and if one is found to load it:

void SolutionEvents_Opened() {
    var solution = _applicationObject.Solution;
    Output("loaded solution '{0}'", solution.FileName);

    // check for solution directory override
    var configFile = Path.Combine(Path.GetDirectoryName(solution.FileName), "solutionsettings.config");
    string settingsFile = null;
    if(File.Exists(configFile)) {
        Output("trying to load config from '{0}'", configFile);
        settingsFile = GetSettingsFile(configFile, settingsFile);
        if(!string.IsNullOrEmpty(settingsFile)) {
            Output("unable to find override '{0}'", settingsFile);
        } else {
            Output("using solutionsettings.config override");
        }
    }

    // check for settings in solution
    if(string.IsNullOrEmpty(settingsFile)) {
        var item = _applicationObject.Solution.FindProjectItem(SETTINGS_KEY);
        if(item != null) {
            settingsFile = item.get_FileNames(1);
            Output("using solution file '{0}'", settingsFile);
        }
    }

    // check for environment override
    if(string.IsNullOrEmpty(settingsFile)) {
        configFile = Environment.GetEnvironmentVariable("solutionsettings.config");
        if(!string.IsNullOrEmpty(configFile)) {
            settingsFile = GetSettingsFile(configFile, settingsFile);
            if(string.IsNullOrEmpty(settingsFile)) {
                Output("unable to find environment override '{0}'", settingsFile);
            } else {
                Output("using environment config override");
            }
        }
    }
    if(string.IsNullOrEmpty(settingsFile)) {
        Output("no custom settings for solution.");
        return;
    }
    var importCommand = string.Format("/import:\"{0}\"", settingsFile);
    try {
        _applicationObject.ExecuteCommand("Tools.ImportandExportSettings", importCommand);
        Output("loaded custom settings\r\n");
    } catch(Exception e) {
        Output("unable to load '{0}': {1}", settingsFile, e.Message);
    }
}

And that's all there is to it.

More work went into figuring out how to build an installer than building the Add-In…. I hate MSIs. At least i was able to write the installer logic in C# rather than one of the more tedious extensibility methods found in InstallShield (the least value for your money I've yet to find in any product) or Wix (a vast improvment over other installers, but it's still the victim of MSIs.)

Installation, source and disclaimer

The source can be found under Apache license at GitHub, which also has an MSI for those just wishing to install it.

NOTE: The MSI and source code come with no express or implied guarantees. It may screw things up in your environment. Consider yourself warned!!

This Add-In does blow away your Visual Studio settings with whatever settings file is discovered via the above methods. So, before installing this you should definitely back up your settings. I've only tested it on my own personal setup, so it may certainly misbehave on someone else's setup. It's certainly possible it screws up your settings or even Visual Studio install. I don't think it will, but I certainly can't call this well tested across environment, so backup and use at your own risk.

By arne on | .net, geek | A comment?
Tags: , , , ,

Promise: Method slots and operators

Before getting into method slots, here's a quick review of the Promise lambda grammar:

lambda:     [<signature>] <expression>;

signature:  (<arg1>, ... <argN>[|<return-type>])

arg:        [<type>] <argName>[=<init-expression>]

expression: <statement> | { <statement1>; ... <statementN>; }

A lambda can be called with positional arguments either with the parentheses-comma convention ( foo(x,y) ) or the space-separated convention ( foo x y ), or with a JSON object as argument ( foo{ bar: x, baz: y} ).

Method Overload (revised)

When i decided to use slots that you assign lambdas as methods, I thought I'd be clever and make those slots polymorphic to get around shortcomings i perceived in the javascript model of just attaching functions to named fields. After listening to Rob Pike talk about Go at OSCON, I decided this bit of cleverness did not serve a useful purpose. In Go there are no overloads, because a different signature denotes different behavior and the method name should reflect that difference. Besides, even if you want overload type behavior in Promise, you can get it via the JSON calling convention:

class Index {
  Search:(|SearchResult) {
     foreach(var keyvaluepair in $_) {
       // handle undeclared named parameters
     }
     ...
  };
}

Basically the lambda signature is used to declare an explicit call contract, but using a JSON object argument, undeclared parameters can just as easily be passed in.

If a method is called with positional arguments instead of a JSON object, the default JSON object will contain a field called args with an array value

class Index {
  Search: {
    ...
  };
}

Index.Search('foo','documents',10);

// $_ => { args: ['foo','documents',10] }

The above signature shows a method assigned a lambda without any signature, i.e. it accepts any input and returns an untyped object. Receiving $_.args is not contingent on that signature, it will always be populated, regardless of the lambda signature.

Wildcard Method

A class can also contain a wildcard method to catch all method calls that don't have an assigned slot.

class Index

  *: {
    var (searchType) = $_._methodname./^Find_(.*)$/;
    if(searchType.IsNil) {
       throw new MethodMissingException();
    }
    ...
  };
}

The wild card method is a slot named *. Retrieving the call arguments is the same as with any other method without declared signature, i.e. $_ is used. In addition, the methodname used in the call is stuffed into $_ as the field _methodname.

The above example shows a method that accepts and call that starts with Find_ and takes the remainder of the name as the document type to find, such as Find_Images, Find_Pages, etc. This is done by using the built in regex syntax, i.e. you can use ./<regex>/ and ./<regex>/<substitution>/ on any string (or the string an object converts to), similar to perl's m// and s///. Like perl, the call returns a list of captures, so using var with a list of fields, in this case one field called searchType, receives the captures, if there is a match.

When a method is called that cannot be found on the Type, it throws a MethodMissingException. A wildcard method is simply a hook that catches that exception. By throwing it ourselves, our wildcard reverts to the default behavior for any method that doesn't match the desired pattern. This also gives parent classes or mix-ins the opportunity to fire their own wildcard methods.

Wildcard methods can only declared in classes and mix-ins, not on Types. Types are supposed to be concrete contracts. The existence of a wildcard does mean that the class can satisfy any Type contract and can be used to dynamically implement type contracts without having to declare each method (think mocks).

Operators

Operators are really just methods called with the whitespace list syntax

var x = 3;
var y = x + 5;  // => 8
var z = x.+(5); // => 8

// most operators implements polish notation as appropriate
var v = x.+(5,6); // => 14

Operators are just methods, which means you can assign them yourselves as well

class Query {
 List<Query> _compound;
 op+: {
    var q = Query();
    q._compound.AddRange(_compound);
    q._compound.AddRange($_args);
    q;
  };
}

The only difference between a normal method slot and an operator slot is that the operator slot has the op prefix for disambiguation.

And now for the hard part

That concludes the overview of the things I think make Promise unique. There's certainly tons more to define for a functioning language, but most of that is going to be very much common syntax. So now it's time to buckle down dig into antlr and the DLR to see what it will take to get some semblance of Promise functioning.

More about Promise

This is a post in an ongoing series of posts about designing a language. It may stay theoretical, it may become a prototype in implementation or it might become a full language. You can get a list of all posts about Promise, via the Promise category link at the top.

By arne on | Promise, geek | A comment?
Tags: , ,

Promise: Object notation and serialization

I thought I had only one syntax post left before diving into posts about attempting to implement the language. But starting on a post about method slots and operators, I decided that there was something else i needed to cover in more detail first: The illustrious JSON object.

I've alluded to JSON objects more than a couple of times in previous posts, generally as an argument for lambda calls. Since everything in Promise is a class, JSON objects are bit of an anomaly. Simply, they are the serialization format of Promise, i.e. any object can be reduced to a JSON graph. As such it exists outside the normal class regime. It is also closer to BSON, as it will retain type information unless serialized to text, and can be serialized on the wire either as JSON or BSON. So really it looks like javascript object notation (JSON) but it's really Promise object notation. For simplicity, i'm going to keep calling it JSON tho.

Initialization

Creating a JSON object is the same as in javascript:

var a = {};
var b = [];
var c = { foo: ["bar","baz"] };
var d = { song: Song{name: "ShopVac"} };

The notation accepts hash and array initializers and their nesting, as well as object instances as values. Fields are always strings.

Serialization

The last example shows that you can put Promise objects into a JSON graph, and the object initializer itself takes another JSON object. I explained in "Promise: IoC Type/Class mapping" that passing a JSON object to the Type allows the mapped class constructor to intercept it, but in the default case, it's simply a mapping of fields:

class Song {
  _name;
  Artist _artist;

  Artist:(){ _artist; }
  ...
}

class Artist {
  _name;
  ...
}

var song = Song{ name: "The Future Soon", artist: { name: "Johnathan Coulton" } };

// get the Artist object
var artist = song.Artist;

//serialize the object graph back to JSON
print song.Serialize();
// => { name: "The Future Soon", artist: { name: "Johnathan Coulton" } };

Lacking any intercepts and maps, the initializer will assign the value name to _name, and when it maps artist to _artist, the typed nature of _artist invokes its initializer with the JSON object from the artist field. Once .Serialize() is called, the graph is reduced to the most basic types possible, i.e. the Artist object is serialized as well. Since the serialization format is meant for passing DTOs, not Types, the type information (beyond fundamental types like String, Num, etc.) is lost at this stage. Circular references in the graph would be dropped–any object already encountered in serialization causes the field to be omitted. It is omitted rather than set to nil so that its use as an initializer does not set the slot to nil, but allows the default initializer to execute.

Above I mentioned that JSON field values are typed and showed the variable d set to have an object as the value of field song. This setting does not cause Song to be serialized. When assigning values into a JSON object, they retain their type until they are used as arguments for something that requires serialization or are manually serialized.

var d = { song: Song{name: "ShopVac"} };

// this works, since song is a Song object
d.song.Play(); 

var e = d.Serialize(); // { song: { name: "ShopVac" } }

// this will throw an exception
e.song.Play();

// this clones e
var f = e.Serialize();

Serialization can be called as many times as you want and acts as a clone operation for graphs lacking anything further to serialize. The clone is a lazy operation, making it very cheap. Basically a pointer to the original json is returned and it is only fully cloned if either the original or the clone are modified. This means, the penalty for calling .Serialize() on a fully serialized object is minimal and is an ideal way to propagate data that is considered immutable.

Access and modification

JSON objects are fully dynamic and can be access and modified at will.

var x = {foo: "bar"};

// access by dot notation
print x.foo; // => "bar"

// access by name (for programatic access or access of non-symbolic names)
print x["foo"]; // => "bar"

x.foo = ["bar","baz"]; // {foo: ["bar","baz"]}
x.bar = "baz"; // {bar: "baz", foo: ["bar", "baz"]};

// delete a field via self-reference
x.foo.Delete();
// or by name
x["foo"].Delete();

The reason JSON objects exist as entities distinct from class defined objects is to provide a clear separation between objects with behavior and data only objects. Attaching functionality to data should be an explicit conversion from a data object to a classed object, rather mixing the two, javascript style.

Of course, this dichotomy could theoretically be abused with something like this:

var x = {};
var x.foo = (x) { x*x; };
print x.foo(3); // => 9

I am considering disallowing the assignment of lambdas as field values, since they cannot be serialized, thus voiding this approach. I'll punt on the decision until implementation. If lambdas end up as first class objects, the above would have to be explictly prohibited, which may lead me to leave it in. If however, I'd have to manually support this use case, i'm going to leave it out for sure.

JSON objects exist as a convenient data format internally and for getting data in and out of Promise. The ubiquity of JSON-like syntax in most dynamic languages and it's easy mapping to object graphs makes it the ideal choice for Promise to foster simplicity and interop.

More about Promise

This is a post in an ongoing series of posts about designing a language. It may stay theoretical, it may become a prototype in implementation or it might become a full language. You can get a list of all posts about Promise, via the Promise category link at the top.

By arne on | Promise, geek | A comment?
Tags: , ,

IronRuby isn’t dead, it’s been set free on a farm upstate

Yesterday Jimmy Schementi posted his farewell to Microsoft and with it his thoughts on the future of IronRuby. This made it onto twitter almost immediately as "IronRuby is dead". This was shortly followed by a number of voices countering with, "No it's not, it's OSS, step up community!". I think that's rather missing the point.

Microsoft needs Ruby more than Ruby needs Microsoft

I'll even go as far as saying Ruby has no need for Microsoft. But Microsoft has a clear problem with attracting a community of passionate developers building the next generation of apps. There is a large, deeply entrenched community building enterprise apps that is going to stay loyal for a long time, but just as Office will not stay dominant as a monolithic desktop app, the future of development for Microsoft needs to be web based and there's isn't a whole lot of fresh blood coming in that way.

Fresh blood that is passionate and vocal is flocking to Ruby, Scala, Clojure, node.js, etc.Even the vocal developers on the MS stack are pretty much all playing with Ruby or another dynamic language in some capacity or other. Maybe you think that those people are just a squeaky wheel minority, and maybe you are right. But minority or not, they are people who shape the impressions that the next wave of newcomers sees first. It's the bleeding edge guys that pave the road others will follow.

Startup technology decisions are done via peer networks, not by evaluating vendor marketing messages. Instead of trying to attract and keep alpha geeks, Microsoft is pushing technologies like WebMatrix and Lightswitch, as if drag-n-drop/no-code development wasn't an already reviled stereotype of the MS ecosystem.

Some have said that IronRuby is not in the interest of Microsoft. It would just be a gateway drug that makes it even easier to jump the MS ship. Sorry, but that cat's long out of the bag. Ruby's simplicity and ecosystem make jumping ship already as easy as could be. And right now, once they jump ship, with no integration story, they will quickly loose any desire to hold on to their legacy stack.

IronRuby, while no panacea to these problems, at least offered a way for people considering .NET to not have to choose Ruby or .NET. And it had the potential to expose already devoted Ruby fans to what .NET could offer on top (I know that one is a much harder sell). If you look at the Ruby space, Ruby often benefits from having another language backing it up, which makes Ruby front-end with Scala back-end popular. And if IronRuby were competitive, being able to bridge a Ruby front-end with a C# or F# back-end and have the option to stay in-process is a story worth trying to sell.

Let the community foster IronRuby, that's what OSS is all about!

Ok, that sounds idealistic and lovely, but, um, what community are you talking about? The Ruby community? They're doing fine without IronRuby. The .NET community? Well, OSS on .NET is tiny compared to all other stacks.

Most OSS projects are 99% consumers with 1% actual comitters. And that's fine, those 99% consumers grow the community and with it the pool of one percenters increases as well. But that only works if the project has an appeal to grow the 99% pool. It is virtually impossible to reach the bulk of the .NET pool without a strong push from Microsoft. I'm always amazed how many .NET developers that I meet are completely oblivious to technology not eminating from Redmond. And these are not just people that stumbled into VB because their Excel macros didn't cut it anymore. There are good, smart developers out there that have been well served by Microsoft and have not felt the need to look outside. For a vendor that's a great captive audience, worth a lot of money, so it's been in the interest of Microsoft to keep it that way. But that also means that for a project to get momentum beyond alpha geeks on the MS stack, it's gotta be pushed by Microsoft. It needs to be a first-class citizen on the platform, built into Visual Studio, etc.

The IronRuby catch-22

IronRuby has the potential to draw fresh blood into the existing ecosystem, but won't unless it's already got a large momentum inside of the .NET ecosystem. Nobody is going to use IronRuby instead of Ruby because it's almost as good. You can't get that momemtum without leveraging the captive audience. And you can't leverage that captive audience without real support from within Microsoft. The ball's been in Microsoft's court, but apparently it rolled under a table and has been forgotten.

By arne on | .net, geek | 3 comments
Tags: , ,

When cloning isn’t faster than copy via serializer

Yesterday I removed Subzero's dependency on Metsys.Little. This wasn't because I have any problem with it. On the contrary, it's a great library and provides really easy serialization and deserialization. But since the whole point of Subzero is to provide very lightweight data passing, i figured cloning is preferable.

My first pass was purely a "get it working" attempt and I knew that I left a lot of performance on the table by reflecting over the type each time i cloned. Didn't think it would be this bad tho. I used two classes for my test, a Simple and a Complex object:

public class Simple {
    public int Id { get; set; }
    public string Name { get; set; }
}

public class Complex {
    public Simple Owner { get; set; }
    public IList<Simple> Friends { get; set; }
}

The results were sobering:

Simple Object:
  Incubator.Clone: 142k/sec
  BinaryFormatter:  50k/sec
  Metsys.Little:   306k/sec
  ProtoBuf:        236k/sec

Complext Object:
  Incubator.Clone: 37k/sec
  BinaryFormatter: 15k/sec
  Metsys.Little:   44k/sec
  ProtoBuf:        80k/sec

As expected, BinaryFormatter is the worst and my Incubator.Clone beats it. But clearly, I either need to do a lot of optimizing, or not bother with cloning, because Metsys.Little and ProtoBuf are far superior. My problem was reflection, so I took a look at MetSys.Litte's code, since it had to do the same things as clone + reading/writing binary. This lead me to "Reflection : Fast Object Creation" by Dodge Common Performance Pitfalls to Craft Speedy ApplicationsJoel Pobar, both of which provide great insight on how to avoid performance problems with Reflection.

The resulting values are

Simple Object:
  Incubator.Clone: 458k/sec

Complext Object:
  Incubator.Clone: 123k/sec

That's more like it. 1.5x over MetSys.Little on simple and 1.5x over Protobuf on complex. I still have to optimize the collection cloning logic, which should improve complex objects, since a complex, collection-less graph resulted in this:

Complext Object:
  Incubator.Clone: 226k/sec
  BinaryFormatter:  20k/sec
  Metsys.Little:   113k/sec
  ProtoBuf:         82k/sec

Metsys.Little pulled back ahead of ProtoBuf, but Incubator pulled ahead enough that it's overall ratio is now 2x. So, i should have some more performance to squeeze out of Incubator.

The common lesson from all this is that you really need to measure. Things that "ought to be fast" may turn out to be disappointing. But equally important, imho, is get things working simply first, then measure to see if optimizing is needed. Just as bad as assuming it's going to be fast is assuming that it's going to be slow and prematurely optimizing.

Reproducing Promise IoC in C#

Another diversion before getting back to actual Promise language syntax description, this time trying to reproduce the Promise IoC syntax in C#. Using generics gets us a good ways there, but we do have to use a static method on a class as the registrar giving us this syntax:

$#[Catalog].In(:foo).Use<DbCatalog>.ContextScoped;
// becomes
Context._<ICatalog>().In("foo").Use<DbCatalog>().ContextScoped();

Not too bad, but certainly more syntax noise. Using a method named _ is rather arbitrary, i know, but it at least kept it more concise. Implementation-wise there's a lot of assumptions here: This approach forces the use of interfaces for Promise types, which can't be enforced by generic constraints. It would also be fairly simple to pre-initialize the Context with registrations that look for all interfaces IFoo and then find the implementor Foo and register that as the default map, mimicking the default Promise behavior by naming convention instead of Type/Class name shadowing.

Next up, instance resolution:

var foo = Foo();
// becomes
var foo = Context.Get<IFoo>();

This is where the appeal of the syntax falls down, imho. At this point you might as well just go to constructor injection, as with normal IoC. Although you do need that syntax for just plain inline resolution.

And the whole thing uses a static class, so that seems rather hardcoded. Well, at least that part we can take care of: If we follow the Promise assumption that a context is always bound to a thread, we can use [ThreadStatic] to chain context hierarchies together so that what looks like a static accessor is really just a wrapper around the context thread state. Given the following Promise syntax:

context(:inner) {
  $#[Catalog].In(:inner).ContextScoped;
  $#[Cart].ContextScoped;

  var catalogA = Catalog();
  var cartA = Cart();

  context {
    var catalogB = Catalog(); // same instance as catalogA
    var catalogC = Catalog(); // same instance as A and B
    var cartB = Cart(); // different instance from cartA
    var cartC = Cart(); // same instance as cartB
  }
}

we can write it in C# like this:

using(new Context("inner")) {
  Context._<ICatalog>().In("inner").ContextScoped();
  Context._<ICart>().ContextScoped();

  var catalogA = Context.Get<ICatalog>();
  var cartA = Context.Get<ICart>();

  using(new Context()) {
    var catalogB = Context.Get<ICatalog>(); // same instance as catalogA
    var catalogC = Context.Get<ICatalog>(); // same instance as A and B
    var cartB = Context.Get<ICart>(); // different instance from cartA
    var cartC = Context.Get<ICart>(); // same instance as cartB
  }
}

This works because Context is IDisposable. When we new up an instance, it takes the current threadstatic and stores it as it's parent and sets itself as the current. Once we leave the using() block, Dispose() is called, at which time, we set the current context's parent back as current, allowing us to build up and un-roll the context hierarchy:

public class Context : IDisposable {

  ...

  [ThreadStatic]
  private static Context _current;

  private Context _parent;

  public Context() {
    if(_current != null) {
      _parent = _current;
    }
    _current = this;
  }

  public void Dispose() {
    _current = _parent;
  }
}

I'm currently playing around with this syntax a bit more and using Autofac inside the Context to do the heavy lifting. If I find the syntax more convenient than plain Autofac, i'll post the code on github.

By arne on | .net, geek | A comment?
Tags: , , ,

Freezing DTOs by proxy

In my previous post about sharing data without sharing data state, I proposed an interface for governing the transformation of mutable to immutable and back for data transfer objects (DTOs) that are shared via a cache or message pipeline. The intent was to avoid copying the object when not needed. The interface I came up with was this:

public interface IFreezable<T> where T : class {
    bool IsFrozen { get; }

    void Freeze();
    T FreezeDry();
    T Thaw();
}

The idea being that a mutable instance could be made immutable by freezing it and receivers of the data could create their own mutable copy in case they needed to track local state.

  • Freeze() freezes a mutable instance and is a no-op on a frozen instance
  • FreezeDry() returns a frozen clone of a mutable instance or the current already frozen instance. This method would be called by the container when data is submitted to make sure the data is immutable. If the originator froze it's instance, no copying is required to put the data into the container
  • Thaw() will always clone the instance, whether its frozen or not, and return a frozen instance. This is done so that no two threads accidentally get a reference to the same mutable instance.

Straight forward behavior but annoying to implement on objects that really should only be data containers and not contain functionality. You either have to create a common base class or roll the behavior for each implementation. Either is annoying and a distraction.

Freezable by proxy

What we really want is a mix-in that we can use to attach shared functionality to DTOs without their having to take on a base class or implementation burden. Since .NET doesn't have mix-ins, we do the next best thing: We wrap the instance in question with a Proxy that takes on the implementation burden.

Consider this DTO, implementing IFreezable:

public class FreezableData : IFreezable<FreezableData> {

    public virtual int Id { get; set; }
    public virtual string Name { get; set; }

    #region Implementation of IFreezable<Data>
    public virtual void Freeze() { throw new NotImplementedException(); }
    public virtual FreezableData FreezeDry() { throw new NotImplementedException(); }

    public virtual bool IsFrozen { get { return false; } }
    public virtual FreezableData Thaw() { throw new NotImplementedException(); }
    #endregion
}

This is a DTO that supports the interface, but lacks the implementation. The implementation we can provide by transparently wrapping it with a proxy.

var freezable = Freezer.AsFreezable(new FreezableData { Id = 42, Name = "Everything" });
Assert.isTrue(Freezer.IsFreezable(data);

// freeze
freezable.Freeze();
Assert.IsTrue(freezable.IsFrozen);

// this will throw an exception
freezable.Name = "Bob";

Under the hood, Freezer uses Castle's DynamicProxy to wrap the instance and handle the contract. Now we have an instance of FreezableData that supports the IFreezable contract. Simply plug the Freezer call into your IoC or whatever factory you use to create your DTOs and you get the behavior injected.

In order to provide the cloning capabilities required for FreezeDry() and Thaw(), I'm using MetSys.Little and perform a serialize/deserialize to clone the instance. This is currently hardcoded and should probably have some pluggable way of providing serializers. I also look for a method with signature T Clone() and will use that instead of serialize/deserialize to clone the DTO. It would be relatively simple to write a generic memberwise deep-clone helper that works in the same way as the MetSys.Little serializer, except that it never goes to a byte representation.

But why do I need IFreezable?

With the implementation injected and its contract really doing nothing but ugly up the DTO, why do we have to implement IFreezable at all? Well, you really don't. Freezer.AsFreezable() will work on any DTO and create a proxy that implements IFreezable.

public class Data {
    public virtual int Id { get; set; }
    public virtual string Name { get; set; }
}

var data = Freezer.AsFreezable(new Data{ Id = 42, Name = "Everything" });
Assert.IsTrue(Freezer.IsFreezable(data);

// freeze
var freezable = data as IFreezable<Data>;
freezable.Freeze();
Assert.IsTrue(freezable.IsFrozen);

No more unsightly methods on our DTO. But in order to take advantage of the capabilties, we need to cast it to IFreezable, which from the point of view of a container, cache or pipeline, is just fine, but for general usage, might be just another form of ugly. Luckily Freezer also provides helpers to avoid the casts:

// freeze
Freezer.Freeze(data);
Assert.IsTrue(Freezer.IsFrozen(data));

// thaw, which creates a clone
var thawed = Freezer.Thaw(data);
Assert.AreEqual(42, thawed.Id);

Everything that can be done via the interface also exists as a helper method on Freezer. FreezeDry() and Thaw() will both work on an unwrapped instance, resulting in wrapped instances. That means that a container could easily take non-wrapped instances in and just perform FreezeDry() on them, which is the same as receiving an unfrozen IFreezable.

Explicit vs. Implicit Behavior

Whether to use the explicit interface implementation or just wrap the DTO with the implementation comes down to personal preference. Either methods works the same, as long as the DTO satisfies a couple of conditions:

  • All data must be exposed as Properties
  • All properties must be virtual
  • Cannot contain any methods that change data (since it can't be intercepted and prevented on a frozen instance)
  • Collections are not yet supported as Property types (working on it)
  • DTOs must have a no-argument constructor (does not have to be public)

If the DTO represents a graph, the same must be true for all child DTOs.

The code is functional but still a work in progress. It's released under the Apache license and can be found over at github.

By arne on | .net, geek | A comment?
Tags: , ,

Sharing data without sharing data state

I'm taking a break from Promise for a post or two to jot down some stuff that I've been thinking about while discussing future enhancements to MindTouch Dream with @bjorg. In Dream all service to service communication is done via HTTP (although the traffic may never hit the wire). This is very powerful and flexible, but also has performance drawbacks, which have led to many data sharing discussions.

Whether you are using data as a message payload or even just putting data in a cache, you want sender and receiver to be unable to see each others interaction with that data, which would happen if the data was a shared, mutable instance. If you were to allow shared modification on purpose or on accident can have very problematic consequences:

  1. Data corruption: Unless you wrap the data with a lock, two threads could try to modify the data at the same time
  2. Difference in distributed behavior: As soon as the payload crosses a process boundary, it ceases to be shared so changing topology, changes data behavior

There are a number of different approaches for dealing with this, each a trade-off in performance and/or usability. I'll use caching as the use case, since it's a bit more universal than message passing, but the same patterns applies.

Cloning

A naive implementation of a cache might just be a dictionary. Sure, you've wrapped the dictionary access with a mutex, so that you don't get corruption accessing the data. But multiple threads would still have access to the same instance. If you aren't aware of this sharing, expect to spend lots of time trying to debug this behavior. If you are unlucky it's not causing crashes but causes strange data corruption that you won't even know about until your data is in shambles. If you are lucky the program crashes because of an access violation of some sort.

Easy, we'll just clone the data going into the cache. Hrm, but now two threads getting the value are still messing with each other. Ok, fine, we'll clone it coming out of the cache. Ah, but if the orignal thread is still manipulating its copy data while others are getting the data, the cache keeps changing. That kind of invalidates the purpose of caching data.

So, with cloning we have to copy the data going in and coming back out. That's quite a bit of copying and in the case that the data goes into the cache and expires before someone uses it, it's a wasted copy to boot.

Immutability

If you've paid any attention to concurrency discussions you've heard the refrain from the functional camp that data should be immutable. Every modfication of the data should be a new copy with the orginal unchanged. This is certainly ideal for sharing data without sharing state. It's also a degenerative version of the cloning approach above, in that we are constantly cloning, whether we need to or not.

Unless your language supports immutable objects at a fundamental level, you are likely to be building this by hand. There's certainly ways of mitigating its cost, using lazy cloning, journaling, etc. i.e. figuring out when to copy what in order to stay immutable. But likely you are going to be building a lot of plumbing.

But if the facilities exist and if the performance characteristics are acceptable, Immutability is the safest solution.

Serialization

So far I've ignored the distributed case, i.e. sending a message across process boundaries or sharing a cache between processes. Both Cloning and Immutability rely on manipulating process memory. The moment the data needs to cross process boundaries, you need to convert it into a format that can be re-assembled into the same graph, i.e. you need to serialize and deserialize the data.

Serialization is another form of Immutability, since you've captured the data state and can re-assemble it into the original state with no ties to the original instance. So Serialization/Deserialization is a form of Cloning and can be used as an engine for immutability as well. And it goes across the wire? Sign me up, it's all i need!

Just like Immutability, if the performance characteristics are acceptable, it's a great solution. And of course, all serializers are not equal. .NET's default serializer, i believe, exists as a schoolbook example of how not to do it. It's by far the slowest, biggest and least flexible ones. On other end of scale, google's protobuf is the fastest and most compact I've worked with, but there are some flexibility concessions to be made. BSON is a decent compromise when more flexibility is needed. A simple, fast and small enough serializer for .NET that i like is @karlseguin's Metsys.Little. Regardless of serializer, even the best serializer is still a lot slower than copying in-process memory, never mind not even having to copy that memory.

Freeze

It would be nice to avoid the implicit copies and only copy or serialize/deserialize when we need to. What we need is for a way for the originator to be able to declare that no more changes will be made to the data and for the receivers of the data to declare whether they intend to modify the retrieved data, providing the folowing usage scenarios:

  • Originator and receiver won't change the data: same instance can be used
  • Originator will change data, receiver won't: need to copy in, but not coming out
  • Originator won't change the data, receiver will: can put instance in, but need to copy on the way out

In Ruby, freeze is a core language concept (although I profess my ignorance of not knowing how to get a mutable instance back again or whether this works on object graphs as well.) To let the originator and receiver declare their intended use of data in .NET, we could require data payloads to implement an interface, such as this:

public interface IFreezable<T> {
  bool IsFrozen { get; }

  void Freeze(); // freeze instance (no-op on frozen instance)
  T FreezeDry(); // return a frozen clone or if frozen, the current instance
  T Thaw();      // return an unfrozen clone (regardless whether instance is frozen)
}

On submitting the data, the container (cache or message pipeline) will always call FreezeDry() and store the returned instance. If the originator does not intend to modify the instance submitted further, it can Freeze() it first, turning the FreezeDry() that the container does into a no-op.

On receipt of the data, the instance is always frozen, which is fine for any reference use. But should the receiver need to change it for local state tracking, or submitting the changed version, it can always call Thaw() to get a mutable instance.

While IFreezable certainly offers some benefits, it'd be a pain to add to every data payload we want to send. This kind of plumbing is a perfect scenario for AOP, since its a concern of the data consumer not of the data. In my next post, I'll talk about some approaches to avoid the plumbing. In the meantime, the WIP code for that post can be found on github.