Guest post: Introducing the Open Data Definition

The Open Data Definition is a new format for the import and export of data from within social applications.

This is a guest post by Ben Werdmuller who is the CTO of Curverider and one of the leads behind the Elgg open source social application engine.

Introducing the Open Data Definition
The Open Data Definition is a new format for the import and export of data from within social applications. Rather than an academic exercise, it's driven by necessity; hopefully, software companies and individual developers who feel the same pressures will join the conversation and build it with us. Data portability is an important issue, and it needs to be solved with practical solutions that work in the real world.

Last September, Marc Canter held the first Data Sharing Summit in an office on an industrial estate in Richmond. The two day event was an effort to cajole social application providers into making their applications talk to each other in a standard way. In fact, the event served mostly to illustrate how difficult that would be to achieve. On the desktop, a file saved by one application can be opened by another application that does a similar thing, but it's taken decades of software development and competition to get there - and by comparison, the Web is an infant.

Users are waking up to the underlying issues. Facebook recently caught some flak from the New York Times for its closed policy on user data. When the videoblogger Robert Scoble publicly got his account deleted for abusing the site's terms of service, he was flooded with requests for instructions from people eager to copy him. This kind of abuse is the only sure-fire way of making sure your data is wiped clean from the service, and even then, there's still no good way of exporting your data before you kill your profile.

Chris Saad, the Australian entrepreneur behind Particls, has given data portability a focal point (and logo) at DataPortability.org. The site suggests a set of simple formats that application developers should standardise upon in order to make their software work together. Some are now familiar names (RSS, OpenID) while others are newcomers (Saad's own APML), but each covers a particular base: RSS allows for simple syndication, OpenID standardises authentication, and so on.

Although each format is limited in scope, they collectively serve as a useful benchmark for openness. Certainly, we intend to support all of them with our Elgg social application engine when we relaunch it this summer, so that social applications built on top of its core don't need to worry about building in compatibility. Alas, with the exception of RSS, support for most of the listed formats is still very rare. You can subscribe to content, but you can't export it.

It's almost impossible to actually import and export your data from one application to another, for example to move your profile to another service, or to ensure that your data is preserved if the site you created it on goes out of business. Service closures are commonplace in our era of advertising-based business models, and they will doubtless become more so as the economy takes a turn for the worse. Chances are, when your favourite photo service goes down, so do your photos.

A number of half-solutions are available. For years, many services have had interfaces that provide access to your data through third party tools. However, these are proprietary – they vary from service to service – and a tool written for one web application will most likely not work with another. The enterprise market has a standard called SOAP, but this is much too heavyweight for most needs and too cumbersome to support for most web coders. There are services that attempt to mediate between proprietary APIs, but this again leaves you reliant on a single point of failure.

The semantic web community has RDF, a format designed for the purpose that is potentially powerful but – as one might expect from the semantic web community – prone to ambiguity and overcomplicated implementation. In small doses, it works (FOAF is based on a subset of RDF), but for more abstract data, it becomes exponentially harder to build for. Adding new data fields requires doing contortions in XML, which makes it harder to generate dynamically. RDF parsers are also not widely supported, and it seems unlikely that most web coders would bother to read through the specification, let alone sit down and actually write compliant software.

This winter, we were faced with a dilemma. The markets our products are designed for require import / export functionality (or at least, we believe it should be a feature), but no suitable format existed. With this in mind, and not before exploring the alternatives, we built the Open Data Definition (ODD): an extremely simple format that allows for the import, export, syndication and streaming of just about any kind of data. The specification is a couple of pages, and implementation takes about forty-five minutes. We built it into Elgg, and although our software depends on plugins that add completely new types of functionality (a blog, a CRM tool), the engine will export to ODD without any further work.

You can find details of the specification over at OpenDD.net, as well as a mailing list. It takes wide support before a format can become a standard, and in order to gain that, it needs to meet as many people's needs as possible. We'd like to invite you to join in and make sure it meets yours.