I thought I had only one syntax post left before diving into posts about attempting to implement the language. But starting on a post about method slots and operators, I decided that there was something else i needed to cover in more detail first: The illustrious JSON object.
I've alluded to JSON objects more than a couple of times in previous posts, generally as an argument for lambda calls. Since everything in Promise is a class, JSON objects are bit of an anomaly. Simply, they are the serialization format of Promise, i.e. any object can be reduced to a JSON graph. As such it exists outside the normal class regime. It is also closer to BSON, as it will retain type information unless serialized to text, and can be serialized on the wire either as JSON or BSON. So really it looks like javascript object notation (JSON) but it's really Promise object notation. For simplicity, i'm going to keep calling it JSON tho.
Creating a JSON object is the same as in javascript:
var a = {};
var b = [];
var c = { foo: ["bar","baz"] };
var d = { song: Song{name: "ShopVac"} };
The notation accepts hash and array initializers and their nesting, as well as object instances as values. Fields are always strings.
The last example shows that you can put Promise objects into a JSON graph, and the object initializer itself takes another JSON object. I explained in "Promise: IoC Type/Class mapping" that passing a JSON object to the Type allows the mapped class constructor to intercept it, but in the default case, it's simply a mapping of fields:
class Song {
_name;
Artist _artist;
Artist:(){ _artist; }
...
}
class Artist {
_name;
...
}
var song = Song{ name: "The Future Soon", artist: { name: "Johnathan Coulton" } };
// get the Artist object
var artist = song.Artist;
//serialize the object graph back to JSON
print song.Serialize();
// => { name: "The Future Soon", artist: { name: "Johnathan Coulton" } };
Lacking any intercepts and maps, the initializer will assign the value name to _name, and when it maps artist to _artist, the typed nature of _artist invokes its initializer with the JSON object from the artist field. Once .Serialize() is called, the graph is reduced to the most basic types possible, i.e. the Artist object is serialized as well. Since the serialization format is meant for passing DTOs, not Types, the type information (beyond fundamental types like String, Num, etc.) is lost at this stage. Circular references in the graph would be dropped–any object already encountered in serialization causes the field to be omitted. It is omitted rather than set to nil so that its use as an initializer does not set the slot to nil, but allows the default initializer to execute.
Above I mentioned that JSON field values are typed and showed the variable d set to have an object as the value of field song. This setting does not cause Song to be serialized. When assigning values into a JSON object, they retain their type until they are used as arguments for something that requires serialization or are manually serialized.
var d = { song: Song{name: "ShopVac"} };
// this works, since song is a Song object
d.song.Play();
var e = d.Serialize(); // { song: { name: "ShopVac" } }
// this will throw an exception
e.song.Play();
// this clones e
var f = e.Serialize();
Serialization can be called as many times as you want and acts as a clone operation for graphs lacking anything further to serialize. The clone is a lazy operation, making it very cheap. Basically a pointer to the original json is returned and it is only fully cloned if either the original or the clone are modified. This means, the penalty for calling .Serialize() on a fully serialized object is minimal and is an ideal way to propagate data that is considered immutable.
JSON objects are fully dynamic and can be access and modified at will.
var x = {foo: "bar"};
// access by dot notation
print x.foo; // => "bar"
// access by name (for programatic access or access of non-symbolic names)
print x["foo"]; // => "bar"
x.foo = ["bar","baz"]; // {foo: ["bar","baz"]}
x.bar = "baz"; // {bar: "baz", foo: ["bar", "baz"]};
// delete a field via self-reference
x.foo.Delete();
// or by name
x["foo"].Delete();
The reason JSON objects exist as entities distinct from class defined objects is to provide a clear separation between objects with behavior and data only objects. Attaching functionality to data should be an explicit conversion from a data object to a classed object, rather mixing the two, javascript style.
Of course, this dichotomy could theoretically be abused with something like this:
var x = {};
var x.foo = (x) { x*x; };
print x.foo(3); // => 9
I am considering disallowing the assignment of lambdas as field values, since they cannot be serialized, thus voiding this approach. I'll punt on the decision until implementation. If lambdas end up as first class objects, the above would have to be explictly prohibited, which may lead me to leave it in. If however, I'd have to manually support this use case, i'm going to leave it out for sure.
JSON objects exist as a convenient data format internally and for getting data in and out of Promise. The ubiquity of JSON-like syntax in most dynamic languages and it's easy mapping to object graphs makes it the ideal choice for Promise to foster simplicity and interop.
This is a post in an ongoing series of posts about designing a language. It may stay theoretical, it may become a prototype in implementation or it might become a full language. You can get a list of all posts about Promise, via the Promise category link at the top.
Yesterday I removed Subzero's dependency on Metsys.Little. This wasn't because I have any problem with it. On the contrary, it's a great library and provides really easy serialization and deserialization. But since the whole point of Subzero is to provide very lightweight data passing, i figured cloning is preferable.
My first pass was purely a "get it working" attempt and I knew that I left a lot of performance on the table by reflecting over the type each time i cloned. Didn't think it would be this bad tho. I used two classes for my test, a Simple and a Complex object:
public class Simple {
public int Id { get; set; }
public string Name { get; set; }
}
public class Complex {
public Simple Owner { get; set; }
public IList<Simple> Friends { get; set; }
}
The results were sobering:
Simple Object: Incubator.Clone: 142k/sec BinaryFormatter: 50k/sec Metsys.Little: 306k/sec ProtoBuf: 236k/sec Complext Object: Incubator.Clone: 37k/sec BinaryFormatter: 15k/sec Metsys.Little: 44k/sec ProtoBuf: 80k/sec
As expected, BinaryFormatter is the worst and my Incubator.Clone beats it. But clearly, I either need to do a lot of optimizing, or not bother with cloning, because Metsys.Little and ProtoBuf are far superior. My problem was reflection, so I took a look at MetSys.Litte's code, since it had to do the same things as clone + reading/writing binary. This lead me to "Reflection : Fast Object Creation" by Dodge Common Performance Pitfalls to Craft Speedy ApplicationsJoel Pobar, both of which provide great insight on how to avoid performance problems with Reflection.
The resulting values are
Simple Object: Incubator.Clone: 458k/sec Complext Object: Incubator.Clone: 123k/sec
That's more like it. 1.5x over MetSys.Little on simple and 1.5x over Protobuf on complex. I still have to optimize the collection cloning logic, which should improve complex objects, since a complex, collection-less graph resulted in this:
Complext Object: Incubator.Clone: 226k/sec BinaryFormatter: 20k/sec Metsys.Little: 113k/sec ProtoBuf: 82k/sec
Metsys.Little pulled back ahead of ProtoBuf, but Incubator pulled ahead enough that it's overall ratio is now 2x. So, i should have some more performance to squeeze out of Incubator.
The common lesson from all this is that you really need to measure. Things that "ought to be fast" may turn out to be disappointing. But equally important, imho, is get things working simply first, then measure to see if optimizing is needed. Just as bad as assuming it's going to be fast is assuming that it's going to be slow and prematurely optimizing.
I'm taking a break from Promise for a post or two to jot down some stuff that I've been thinking about while discussing future enhancements to MindTouch Dream with @bjorg. In Dream all service to service communication is done via HTTP (although the traffic may never hit the wire). This is very powerful and flexible, but also has performance drawbacks, which have led to many data sharing discussions.
Whether you are using data as a message payload or even just putting data in a cache, you want sender and receiver to be unable to see each others interaction with that data, which would happen if the data was a shared, mutable instance. If you were to allow shared modification on purpose or on accident can have very problematic consequences:
There are a number of different approaches for dealing with this, each a trade-off in performance and/or usability. I'll use caching as the use case, since it's a bit more universal than message passing, but the same patterns applies.
A naive implementation of a cache might just be a dictionary. Sure, you've wrapped the dictionary access with a mutex, so that you don't get corruption accessing the data. But multiple threads would still have access to the same instance. If you aren't aware of this sharing, expect to spend lots of time trying to debug this behavior. If you are unlucky it's not causing crashes but causes strange data corruption that you won't even know about until your data is in shambles. If you are lucky the program crashes because of an access violation of some sort.
Easy, we'll just clone the data going into the cache. Hrm, but now two threads getting the value are still messing with each other. Ok, fine, we'll clone it coming out of the cache. Ah, but if the orignal thread is still manipulating its copy data while others are getting the data, the cache keeps changing. That kind of invalidates the purpose of caching data.
So, with cloning we have to copy the data going in and coming back out. That's quite a bit of copying and in the case that the data goes into the cache and expires before someone uses it, it's a wasted copy to boot.
If you've paid any attention to concurrency discussions you've heard the refrain from the functional camp that data should be immutable. Every modfication of the data should be a new copy with the orginal unchanged. This is certainly ideal for sharing data without sharing state. It's also a degenerative version of the cloning approach above, in that we are constantly cloning, whether we need to or not.
Unless your language supports immutable objects at a fundamental level, you are likely to be building this by hand. There's certainly ways of mitigating its cost, using lazy cloning, journaling, etc. i.e. figuring out when to copy what in order to stay immutable. But likely you are going to be building a lot of plumbing.
But if the facilities exist and if the performance characteristics are acceptable, Immutability is the safest solution.
So far I've ignored the distributed case, i.e. sending a message across process boundaries or sharing a cache between processes. Both Cloning and Immutability rely on manipulating process memory. The moment the data needs to cross process boundaries, you need to convert it into a format that can be re-assembled into the same graph, i.e. you need to serialize and deserialize the data.
Serialization is another form of Immutability, since you've captured the data state and can re-assemble it into the original state with no ties to the original instance. So Serialization/Deserialization is a form of Cloning and can be used as an engine for immutability as well. And it goes across the wire? Sign me up, it's all i need!
Just like Immutability, if the performance characteristics are acceptable, it's a great solution. And of course, all serializers are not equal. .NET's default serializer, i believe, exists as a schoolbook example of how not to do it. It's by far the slowest, biggest and least flexible ones. On other end of scale, google's protobuf is the fastest and most compact I've worked with, but there are some flexibility concessions to be made. BSON is a decent compromise when more flexibility is needed. A simple, fast and small enough serializer for .NET that i like is @karlseguin's Metsys.Little. Regardless of serializer, even the best serializer is still a lot slower than copying in-process memory, never mind not even having to copy that memory.
It would be nice to avoid the implicit copies and only copy or serialize/deserialize when we need to. What we need is for a way for the originator to be able to declare that no more changes will be made to the data and for the receivers of the data to declare whether they intend to modify the retrieved data, providing the folowing usage scenarios:
In Ruby, freeze is a core language concept (although I profess my ignorance of not knowing how to get a mutable instance back again or whether this works on object graphs as well.) To let the originator and receiver declare their intended use of data in .NET, we could require data payloads to implement an interface, such as this:
public interface IFreezable<T> {
bool IsFrozen { get; }
void Freeze(); // freeze instance (no-op on frozen instance)
T FreezeDry(); // return a frozen clone or if frozen, the current instance
T Thaw(); // return an unfrozen clone (regardless whether instance is frozen)
}
On submitting the data, the container (cache or message pipeline) will always call FreezeDry() and store the returned instance. If the originator does not intend to modify the instance submitted further, it can Freeze() it first, turning the FreezeDry() that the container does into a no-op.
On receipt of the data, the instance is always frozen, which is fine for any reference use. But should the receiver need to change it for local state tracking, or submitting the changed version, it can always call Thaw() to get a mutable instance.
While IFreezable certainly offers some benefits, it'd be a pain to add to every data payload we want to send. This kind of plumbing is a perfect scenario for AOP, since its a concern of the data consumer not of the data. In my next post, I'll talk about some approaches to avoid the plumbing. In the meantime, the WIP code for that post can be found on github.
Ok, there really isn’t a need for doing this, but since i’m already stuck on creating compact language independent binary representations, here’s a quick struct with an int, a fixed sized string and a variable data field that implements a quick and dirty serialization to go along with it
/// <summary> /// Example of completely manual serialization of a struct containing a fixed /// sized string and a variable sized data block /// </summary> struct FixedStringAndVariableData { // the fixed size of the valueString const int VALUE_SIZE = 10; Int32 id; string valueString; byte[] data; public int Id { get { return id; } set { id = value; } } public int ValueSize { get { return VALUE_SIZE; } } public string Value { get { if( valueString == null ) { valueString = "".PadRight(10,' '); } return valueString; } set { if( value == null ) { valueString = "".PadRight(10,' '); } else if( value.Length < 10 ) { valueString = value.PadRight(10,' '); } else if( value.Length > 10 ) { valueString = value.Substring(0,10); } } } public byte[] Data { get { return data; } set { data = value; } } public FixedStringAndVariableData(byte[] pRaw) { id = BitConverter.ToInt32(pRaw,0); Int32 offset = Marshal.SizeOf(id.GetType()); valueString = Encoding.ASCII.GetString(pRaw,offset,VALUE_SIZE); offset += VALUE_SIZE; Int32 remainder = pRaw.Length - offset; data = new byte[remainder]; for(int i=0;i<remainder;i++) { data[i] = pRaw[offset+i]; } } public byte[] Serialize() { Int32 size = Marshal.SizeOf(id.GetType())+VALUE_SIZE+data.Length; IntPtr pnt = Marshal.AllocHGlobal(size); byte[] serialized = new byte[size]; int position = 0; byte[] buffer = BitConverter.GetBytes(id); buffer.CopyTo(serialized,position); position += buffer.Length; buffer = Encoding.ASCII.GetBytes(this.Value); buffer.CopyTo(serialized,position); position += buffer.Length; data.CopyTo(serialized,position); return serialized; } }
Ok, i’ll get off this subject now
Forgot about putting this up as promised. Of course, this is a very simplistic example with only one variable sized field. If you had more, you’d have to also serialize those field’s sizes and reconstruction becomes a bit more complicated.
But the important part about this, is really the whole concept of pinning some memory down so that you can manipulate it directly and then releasing it back to the control of the Garbage collector. Pretty cool, really.
[StructLayout(LayoutKind.Sequential)] struct DIYSerialize { public Int32 Id; public byte[] Data; public DIYSerialize(Int32 id, string data) { this.Id = id; this.Data = Encoding.ASCII.GetBytes(data); } public DIYSerialize(byte[] Raw) { Int32 size = Raw.Length; IntPtr pnt = Marshal.AllocHGlobal(size); GCHandle pin = GCHandle.Alloc(pnt,GCHandleType.Pinned); Marshal.Copy(Raw,0,pnt,size); this = (DIYSerialize)Marshal.PtrToStructure(pnt,typeof(DIYSerialize)); pin.Free(); } public byte[] Serialize() { Int32 size = Marshal.SizeOf(Id.GetType()) + Data.Length; IntPtr pnt = Marshal.AllocHGlobal(size); GCHandle pin = GCHandle.Alloc(pnt); Marshal.StructureToPtr(this,pnt,false); byte[] d = new byte[size]; Marshal.Copy(pnt,d,0,Data.Length); pin.Free(); Marshal.FreeHGlobal(pnt); return d; } }
I was this close to just using ISerializable for by binary representation for networking. But then talking to n8, i decided that leaving the door open for java interop was important enough in Enterprise computing, that I couldn’t ignore it.
So, I’m back to coming up with my packet structure in manner that i can easily serialize by hand. Right now the plan is something along the line of a fixed sized header followed by a variable sized gzip-Xml chunk and a variable sized raw data chunk. Both of the variable chunks can be 0 length. The header is used for expressing the purpose of the packet, and describing the offsets of the variable packets. The gzip-Xml is for data that the recipient is supposed to parse and act on, while the raw data packet will be just that, raw data that i sued to construct a complete file.
I need to dig up the test code i did for doing manual serialization in .NET. Really, just what would be done in C with a struct and memcpy, but of course, since in C# you don’t have pointers and the memory is managed for you, it’s a bit more of a hoop jumping exercise. The code is on another machine, so i’ll leave that for another post.