A Semantic view of the Wikipedia for Data idea

Last week CNet's Dan Farber picked up on a post by ex-Googler Bret Taylor, entitled 'We need a Wikipedia for data.' Sarah Perez followed up on ReadWriteWeb with a useful roundup in 'Where to find Open Data on the Web,' and the usual flurry of interested individuals commented on each.

Last week CNet's Dan Farber picked up on a post by ex-Googler Bret Taylor, entitled 'We need a Wikipedia for data.' Sarah Perez followed up on ReadWriteWeb with a useful roundup in 'Where to find Open Data on the Web,' and the usual flurry of interested individuals commented on each.

The notion of a 'wikipedia for data' is nothing new, and commenters were quick to point to such exemplars of the type as Metaweb's Freebase.

However as we see ever-more mainstream implementation of Semantic Web ideas, the need for usable, addressable, linkable, persistent data becomes ever more pressing. Semantic Web technologies can be deployed inside the firewall with useful results, but the network effects should really begin to resonate at network scale; out on the open Web.

Bret's post does a pretty good job, up front, of summarising the problem;

"I have come to realize how hard it is for a everyday programmer to get access to even the most basic factual data. If you want to experiment with a new driving directions algorithm, it is infinitely more difficult than coming up with an algorithm; you have to hire a lawyer and a sign a contract with a company that collects that data in the country you are developing for. If you want to write an open source TiVo competitor, you need television listings data for every cable provider in the country, but your options are tenuous at best."

Data can be hard to obtain. It's a legal minefield. Comprehensiveness is necessary... but virtually impossible. He goes on to highlight the dubious tactics of current data owners, many of whom make it prohibitively expensive to access commercial data or almost (and it's an important almost) criminally difficult to access public domain data in useful form.

Bret continues, tellingly;

"I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist."

Give that man a standing ovation. Exactly.

In a 2006 report on the Commercial Use of Public Information, the UK Government's Office of Fair Trading suggested that;

"more competition in public sector information could benefit the UK economy by around £1billion [almost $2bn] a year.


The study found that raw information is not as easily available as it should be, licensing arrangements are restrictive, prices are not always linked to costs and PSIHs may be charging higher prices to competing businesses and giving them less attractive terms than their own value-added operations."

Semantic Web applications thrive on data, and assertions about those data in the form of provenanced links from one resource to another. By locking data away, or by exposing crippled subsets of the whole via web interfaces that only a human might traverse, we miss these opportunities.

Yes, (some) businesses would suffer irreparable harm if they opened access to their money tree without also rethinking their Victorian business model. But the UK Government figures (and others) clearly suggest that business (and society) benefits from increased access to this contextual data, even if individual businesses might not.

Wealthy players such as Microsoft and the incumbent search engines might do much here (as they have begun to do with map data) to force a widespread shift in business model, away from enforced scarcity of supply toward plentiful supply and more innovative monetisation of value-added services atop the basic and increasingly commoditised data.

I can - and do - see value in the sort of approach taken by Freebase, in which they set out to become the canonical source of knowledge within a wide range of subjects, and their recent release of data dumps strengthens their case in my eyes.

Personally I am rather more persuaded by the aspirations of the Linked Data projects, which freely expose data on the Web, and actively encourage third parties to use and reuse their data, and to link to it, through it, and from it in an ever-richer web of relationships. As I argued in SemanticReport last year, ready access to data permits the Internet to move inside our next generation of applications in compelling and transformative ways.

Although not itself a Linked Data project, the relationships that Powerset is finding and manipulating in data sets such as those from Wikipedia, Freebase and WordNet is closer to this ideal... and more on Powerset soon.

I am persuaded that a single canonical space cannot succeed, except for a very short time or in a very narrow niche. Instead, we need resilient and distributed mechanisms that enable data to be made available, for that data to be found and enmeshed with other resources to create some new and unanticipated application beyond the ken of the data's original curators.

We do, of course, need appropriate protections to ensure that any explosion of usable data does not see those data abused. For this, we turn to efforts such as the Open Data Commons, whose Open Data license my employer was involved in developing and financing.

Many of these topics are exactly the sort of thing with which the Linked Data projects have been grappling, and I shall be reporting (and speaking) from next week's Linked Data on the Web workshop in Beijing, ahead of the main WWW2008 conference.

It's a pity that Bret will not be with us, as I expect there to be a room full of people who would applaud his desire to see the data, whilst questioning the utility of the (one) DataWiki. The Web is, fundamentally, a distributed creature. It is predicated upon the link. So whilst there is utility in hosting data for those unwilling or unable to do so themselves, why require data to go anywhere before it can be used?

Link to data where it sits, link to it again, and put it to work. The result will be amazing.