Global information behemoth Thomson Reuters today announces the latest version of its Calais web service, delivering on earlier promises with respect to 'Linked Data' and firmly staking out the company's intention to be a significant player in the shifting market for timely and authoritative information.
I'll take a more in-depth look at the importance of authoritative sources in the emerging Linked Data ecosystem in this related post, and concentrate on the specifics of the Calais 4.0 release here.
Thomson Reuters' Tom Tague describes version 4.0 as
"a fundamental change to the underlying service; it's basically a new service"
This re-engineering of Calais will deliver the functionality that users have come to rely upon, whilst ensuring Thomson Reuters' ability to continue to scale in a timely and cost-effective manner on the back of Amazon's Web Services offering.
Tague describes the service released today as a technology preview to run alongside the existing Calais service for a period, but he is confident that it is at production strength from Day 1. Developers, Tague suggested, would
"try it and stay."
In addition to this strengthening of the core offering, Calais 4.0 includes five substantive developments.
First, the company has followed through on earlier talk about 'Linked Data,' ensuring that any of around 25 entity types (company names, geographic areas, album titles, etc) discovered in content submitted to Calais will now be returned to the submitter with a 'dereferenceable URI' that may be followed by either people or software in order to discover further information. The URI resolves to a Calais-hosted page of RDF with pointers to the Linked Data community's usual suspects; DBpedia, MusicBrainz, GeoNames, the CIA Factbook, etc.
More unusually, and importantly, the second development sees the document include pointers to Thomson Reuters own content such as the (current) stock ticker, Board membership data, etc.
As the Press Release notes,
"In keeping with its commitment to the Linked Data standard, Thomson Reuters has also made a subset of its core data assets available for public use on the Web. The collection of business information represents the first contribution to the 'Linked Data cloud' made by a major publisher. It enables developers to programmatically query and use fundamental facts on hundreds of thousands of publically-traded companies, including company descriptions, stock tickers, management teams, locations, boards of directors and more."
Thirdly, Calais 4.0 includes a 'metadata transport layer' to simplify the process of exposing and sharing large bodies of semantically rich data. Tague suggested that 2-300,000,000 persistent and dereferenceable URIs are available today (and capable of servicing tens or hundreds of millions of hits per day) for content previously submitted to Calais, with many more to come as the service scales.
Fourth, Calais is making its first move beyond English language content, and version 4.0 now supports entity extraction in French. French-language relationship and event extraction will follow shortly, as will other languages. Tague suggested that Hebrew, Arabic and Chinese will be amongst those rolled out during 2009. Behind the scenes, the team are also experimenting with automated translation services, which Tague reports to be 'working very well' in the lab.
Fifth, and finally, the Calais team is publishing an RDFS version of their schema, giving developers far more flexibility as to the ways in which they integrate the Calais web service into their own applications.
All in all, a welcome set of incremental improvements to Calais that also serves to raise an interesting set of questions about the role of 'professional' data in the Linked Data ecosystem.
Thomson Reuters' Tom Tague is a regular member of the Semantic Web Gang, and should be discussing the release of Calais 4.0 in more depth on this month's show, due to be recorded on 15 January.