Towards a decentralized, federated status network

So everyone is talking about join.app.net and I agree with almost everything they say. I'm all for promoting services in which the user is once again the customer rather than, as in ad supported systems, the product. But it seems that in the end, everyone wants to build a platform. From the user's perspective, though, that just pushes the ball down the court: You are still beholden to a (hopefully benevolent) overlord. Especially since the premise that such an overlord is required to create a good experience is faulty. RSS, Email, the web itself, are just a couple of techs we rely on that have no central control, are massively accepted, work pretty damn well and continue to enable lots of successful and profitable businesses without anyone owning "the platform". So can't we have a status network truly as infrastructure, built on the same principles and enabling companies to build successful services on top?

Such a system would have to be decentralized, distributed and federated, so that there is no barrier for anyone to start producing or consuming data in the network. As long as there is a single gatekeeper owning the pipeline, the best everyone else can hope to do it innovate on the API surface that owner has had the foresight to expose.

The Components

Thinking about a simple spec for such a status I started on a more technical spec over at github, but wanted to cover the motivations behind the concepts separately here. I'm eschewing defining the network on obviously sympathetic technologies such as ATOM and PubSubHubub, because they provide lots of legacy that is not needed while still not being a proper fit.

I'm also defininig the system as a composition of independent components to avoid requiring the development of a large stack -- or platform -- to participate. Rather implementers should be able to pick only the parts of the ecosystem for which they believe they can offer a competitive value.

Identity

A crucial feature of centralized systems is that of being the guardian of identity: You have an arbiter of people's identity and can therefore trust the origin of an update (at least to a degree). But conversely, the closed system also controls that Identity, making it vulnerable to changes in business models.

A decentralized ecosystem must solve the problems of addressing, locating and authenticating data. It should also be flexible enough that it does not suffer from excessive lock-in with any Identity provider

The internet already provides a perfect mechnism for creating a unique names in the user@host scheme most commonly used by email. The host covers locating the authority, while the user identifies the author. Using a combination of convention and DNS SRV records, a simple REST API can provide a universal whois look-up for any author, recipient or mentioned user in status feeds:

GET:/who/joe

{
  _sig: '366e58644911fea255c44e7ab6468c6a5ec6b4d9d700a5ed3a810f56527b127e',
  id: '[email protected]',
  name: 'Joe Smith',
  status_uri: 'http://droogindustries.com/status',
  feed_uri: 'http:/droogindustries.com/status/feed'
}

That solves the addressing.

As shown above, the identity document includes the uri to the canonical status location (as well as the feed location -- more on that difference later). That solves the locating.

Finally, public/private key encryption is used sign the identity document (as well as any status updates), allowing the authenticity of the document to be verified and allowing all status updates to also be authenticated. The public key (or keys -- allowing for creating new keys without loosing authentication ability for old messages) is part of the feed, while the nameservice record is signed by it. And that solves authentication.

Note that name resolution, name lookup and feed hosting are independent pieces. They certainly could be implemented in a single system by any vendor, but do not have to be and provide transparent interoperation. This separation has the side effect of allowing the user control over their data.

Status Feed

The actual status uri, much like a blog uri, is just an HTML page, not the feed document itself. The feed is identified by a link tag in the status page:

<link rel="alternate" type="application/hsf+json" href="http://droogindustries.com/status/feed"/>

The reason for using an HTML page is that it allows you to separate your status page from the place where the data lives. You could plop it on a plain home page that just serves static files and point to a third party that manages your actual feed. It also allows your status page to provide a user friendly representation of your feed.

The feed document, as the content-type betrays, is simply json. While XML is more flexible, that same flexibility (especially via namespaces) has made XML rather broadly reviled by developers. In contrast, JSON maps more easily to commonly used storage engines, is easier to read and write by hand or tooling, and is readily consumed by javascript, making browser consumption a snap.

While the feed provides meta data about the author and the feeds the author follows, the focus is of course status update entries, which have a minimal form of:

{
  _sig: 'aadefbd0d0bc39a062f87107de126293d85347775152328bf464908430712789',
  id: '4AQlP4lP0xGaDAMF6CwzAQ'
  href: 'http://droogindustries.com/joe/status/feed/4AQlP4lP0xGaDAMF6CwzAQ',
  created_at: '2012-07-30T11:31:00Z',
  author: {
    id: '[email protected]',
    name: 'Joe Smith',
    profile_image.uri: 'http://droogindustries.com/joe.jpg',
    status_uri: 'http://droogindustries.com/joe/status',
    feed_uri: 'http://droogindustries.com/joe/status/feed',
  },
  text: 'Hey #{bob}, current status #{beach} #{vacation}',
  entities: {
    beach: 'http://droogindustries.com/images/beach.jpg',
    bob: '[email protected]',
    vacation '#vacation'
  }
}

In essence, the status network works by massively denormalized propagation of json messages. Even thoug each message has a canonical uri, the network will be storing many copies, realize that updates are basically read-only. There is a mechanism for advisory updates and deletes, but of course there is no guarantee that it such messages will be respected.

As for the message itself, I fully subscribe to the philosophy that status updates need to be short and contain little to no mark-up. Status feeds are a crappy way to have a conversation, and trying to extend them to allow it (looking at you g+) is a mess. In addition, since everybody is storing copies, arbitrary length can become an adoption issue with unreasonable storage requirements. For these reasons, I'm currently setting the size of the message body to be limited to 1000 characters (not bowing to SMS here). For the content format, i'm leaning towards a template format for entity substitution, but also considering a severely limited subset of html.

While feed itself is simply json, there needs to exist an authoring tool if only to be able to push updates to PubSub servers. Since the only publicly visible part of the feed is the json document, the implementers have complete freedom over the experience, just as wordpress, livejournal, tumblr, etc. all author RSS.

Aggregation

If the status feed is about the publishing of ones updates, the aggregators are the maintainers of ones timeline. The aggregator service provides two public functions, receiving updates from subscriptions (described below) and receiving mentions. Both are handled by a single required REST endpoint, which must be published in ones feed.

Aggregators are the the most likely component to grow into applications, as they are the basis for consumption of status. For example, if someone wanted to create mobile applications for status updates, it would need to rely on an aggregator as its backend. Likely an aggregation service provider would integrate the status feed authoring as well and maybe even take on the naming services, since combining them allows for significant economies of scale. The important thing is that these parts do not have to be a single system.

This separation also allows application authors to use a third parties to act as their aggregator. Since aggregators will receive a lot of real-time traffic and are likely to be the arbiters of what is spam, being able to offload this work to a dedicated service provider may be desirable for many status feed applications.

PubSub

A status network relies on timely dissemination of updates, so a push system to followers is a must. Taking a page from PubSubHubub, I propose subscription hubs that feed publishers push their updates into. These hubs in return push the updates to subscribers and mentioned users.

The specification assumes these hubs to be publicly accessible, at least for the subscriber and only requires a rudimentary authentication scheme. It is likely that the ecosystem will require different usage and pricing models with more strenuous authorization and rate-limiting,but at the very least the subscription part will need to be standardized so that aggregators can easily subscribe. It is, however, too early to try to further specialize that part of the spec, and the public subscription mechanism should be sufficient for now.

In addition to the REST spec for publish and subscribe, PubSub services are ideal candidates for other transports, such as RabbitMQ, XMPP, etc. Again, as the traffic needs of the system grow, so will the delivery options, and extension should be fairly painless, as long as the base REST pattern remains as a fallback.

In addition to delivering updates to subscribers, subscription hubs are also responsible for the distribution of mentions. Since each message posted by a publisher already includes the parsed entities and a user's feed and thereby aggregators can be resolved from a name, the hub can deliver mentions in the same way as subscriptions, as POSTs against the aggregator uri.

Since PubSub services know about the subscribers, they are also the repository of followers for a user -- although only available to the publisher, who may opt share this information on their status page.

Discovery

Discovery in existing status networks generally happens in one of the following ways: Reshared entries received from one of the subscribed feeds, browsing the subscriptions of someone you are subscribed to or search (usually search for hashtags).

The first is implicitly implemented by subscribing and aggregating the subscribed feeds.

The second can be exposed in any timeline UI, given the mechanisms of name resolution and feed location.

The last is the only part that is not covered in the existing infrastructure. It comes down to the collection and indexing of users and feeds, which is already publicly available and can even be pushed directly to the indexer.

Discovery of feeds happens via following mentions and re-posts in existing feeds, but does mean that there is no easy way to advertise ones existence to the ecosystem.

All of these indexing/search use cases are opportunities for realtime search services offered by new or existing players. There is little point in formalizing APIs for these, as they will be organic in emergence and either be human digestible or APIs for application developers to integrate with the specific features of the provider.

To name just a few of the unmet needs:

Deep search of only your followers
Indexing of hashtags or other topic identifiers with or without trending
Indexing of name services and feed meta data for feed and user discovery
Registry of feeds with topic meta data

As part of my reference implementation I will set up a registry and world feed (consuming all feeds registered) as a public timeline to give the ecosystem something to get started with.

How it all fits together

As outlined above there are 5 components that in combination create the following workflow:

Users are created by registering a name with a name service and creating a feed
Users create feed entries in their Status Feed
The Status Feed pushes entries to PubSub
PubSub pushes entries to recipients -- subscribers and mentions
Aggregation services collect entries into user timelines
Discovery services also collect entries and expose mechansims to explore the ecosystems data

It's possible to implement all of the above as a single platform that interfacts with the ecosystem via the public endpoints of name service, subscription and aggregation, or to just implement one component and offer it as a service to others that also implement a portion. Either way, the ecosystem as a whole can function wholy decentralized and by adhering to the basic specification put forth herein, implementers can benefit from the rest of the ecosystem.

I believe that sharing the ecosystem will have cumulative benefits for all providers and should discouragw walled garden communities. Sure, someone could set up private networks, but I'm sure an RBL like service for such bad actors would quickly come into existence to prevent them from leeching of the system as a whole.

Who is gonna use it?

The worth of a social network is measured by its size. Clearly there is an adoption barrier to creating a new one. For this reason, I've tried to define a specification as only the simplest thing that could possibly work, given the goals of basic feature parity with existing networks of this type while remaining decentralized, federated and distributed. I hope that by allowing many independent implementations without the burden of a large, complicated stack or any IP restrictions, the specification is easy enough for people to see value in implementing while providing enough opportunities for implementers to find viable niches.

Of course simplicity to implement is meaningless to endusers. While all this composability provides freedom to grow the network, endusers will demand applications (web/mobile/desktop) that bring the pieces together and offer a single, unified experience, hiding the nature of the ecosystem. In the end I see all this a lot like XMPP or SMTP/POP/IMAP. Most people don't anything about it, but use it every day when they use gmail/gtalk, etc.

I'm currently working on a reference implementation in node.js, less to provide infrastructure than to show how simple implementation should be. I was hesitant to post this before having the implementation ready to go, since code talks a lot louder than specs, but I also wanted the opportunity to get feedback on the spec before a defacto implementation enshrines it. I also don't want my implementation to be mistaken as "the platform". We'll see how that approach works out.

In the meantime, the repo, where the code will be found and where the spec is being documented in a less prosaic form, lives at https://github.com/sdether/happenstance.