Calais 2.0 unveiled by Thomson Reuters

In a press release to coincide with this week's Semantic Technology Conference in San Jose, Thomson Reuters subsidiary ClearForest has announced a major upgrade to their OpenCalais web service; Calais 2.0.

In a press release to coincide with this week's Semantic Technology Conference in San Jose, Thomson Reuters subsidiary ClearForest has announced a major upgrade to their OpenCalais web service; Calais 2.0.

Calais was originally launched in January of 2008, and there have been two interim releases in recent months. Calais 2.0, launched today, includes what Calais' Tom Tague describes as

"significantly more new functionality."

As well as an upgrade to the API and a complete overhaul of the OpenCalais website, the announcement includes news of three new user-facing capabilities in the form of integration with the WordPress blogging platform, the Drupal content management system, and the SearchMonkey tools that Yahoo! released last week.

Tague described a 'problem' with earlier releases of Calais, pointing to the fact that,

"it's a web service."

End users, Tague argues, ultimately want applications that solve their problems. Web services such as those from Calais are a step toward those applications, but if real value is to be demonstrated then there is certainly a need for visible and end-user benefits, especially in the early days of a service.

The WordPress plugin, Tagaroo,

"is designed to make your WordPress blog better for you, better for your readers and more accessible to search engines. As you’re writing, Tagaroo analyzes the text in your post and suggests intelligent tags for the things and events you’re writing about."

The plugin passes the text of your draft post to the Calais web service, and uses that to suggest tags that are likely to be relevant. The final decision remains in the hands of the blog author at this stage, and no attempt is made to retrospectively tag historical posts.

In addition to generating tags on the basis of the text, Tagaroo is also able to suggest appropriate (and appropriately licensed) images from Flickr that might most effectively illustrate a post.

The open source Drupal content management system has been gaining in popularity for some time, and there were recent indications that the development team behind the next version of this flexible product are working actively to incorporate Semantic Web specifications such as RDF into the core of the upcoming release. In advance of that, though, Phase2 Technology has produced a complete production-strength Drupal module to integrate Calais web services into the current Drupal release.

The module,

"makes it easy for Drupal users to automatically tag their content [whatever form it may take within Drupal], generating rich Semantic metadata that can be shared via a simple key for integration into the larger content universe."

Tom Tague mentioned that a growing number of smaller media sites, such as local and regional newspapers, make extensive use of Drupal already; making this an obvious and potentially powerful way to allow these organisations to leverage the value of Calais in their existing operations.

Tague also confirmed that the Calais team are in active discussion with the team involved in developing the next version of Drupal, creating opportunities for further integration in future.

The third example of visible inclusion of Calais capabilities is Calais Marmoset;

"a simple yet powerful plugin that makes it easy for publishers to generate and embed metadata in their content in preparation for Yahoo!’s new SearchMonkey service as well as other metacrawlers and semantic applications."

Tague noted that Yahoo! SearchMonkey, although powerful, is today focussed upon manipulation of data that is already structured. Marmoset operates across unstructured text, submitting it to Calais and returning structured microformats encapsulating recognisable tags from the page; microformatted information that may then be automatically passed to Yahoo! for handling inside SearchMonkey.

Behind the scenes, the core API has also been enhanced;

"Calais version 2.0 features a dozen new semantic entity types, improving its utility for pop-culture publishers and bloggers covering media, music, entertainment and sports, as well as those covering pharmaceuticals, medicine and healthcare. It also adds code samples and libraries for accessing Calais, and offers two new output types – the Simple Tags format and Microformats – alongside the standard Resource Description Framework (RDF) option."

Although Tague stressed that RDF remained core to the future of Calais, he recognised that its richness and flexibility could present a barrier to some classes of developer. Inclusion of microformats and Simple Tags was therefore seen as a pragmatic step to bootstrap wider adoption.

The Natural Language Processing (NLP) engine behind early versions of Calais did little with external data that might enhance knowledge of a particular domain or area of interest. With Calais 2.0, the team has started to include information from external 'lexicons' that can then be integrated with the underlying text and passed together through Calais' NLP system. Tague reckons that this process maintains their intended minimum accuracy levels of 90% or more. Initial offerings include information from 12 areas, including recording artists and pharmaceuticals. With the initial steps already taken, Tague estimates that the number of 'open data assets' utilised in this way will grow rapidly, with Linked Data Project participants, Freebase and similar services offering obvious early inclusions.

Tom tells me that there is now a whole team within the Calais project devoted to acquiring these open data assets, in order to embed them within Calais and enhance its knowledge of specific domains. He was also pleased to report that the newly amalgamated Thomson Reuters management has validated and approved the direction being taken by Calais; a journey begun in the pre-merger days of Reuters.

Commenting ahead of the announcement, IDC's Susan Feldman said;

"Calais plugs a hole in the Semantic Web. The problem with the concept of the Semantic Web is that language, in all its variety, makes it difficult to pin down meaning. Established media companies like Thomson Reuters have a great deal of experience in establishing standard vocabularies to help people find information. By uniting the many ways of stating an idea in a single term or group of terms, Calais can begin to spread standard approaches to how we express ideas. Today, if you ask for documents on 'high blood pressure' in a standard keyword matching search engine, you won't get documents about 'hypertension'. If we start to use standard vocabularies, even behind the scenes, we can get users to all the information that is pertinent to their queries."

The notion of 'behind the scenes' to which Susan refers is important here. It is perhaps infeasible to expect everyone to conform to consistent linguistic usage across the Web. It becomes achievable - and valuable - when tools such as Calais are able to perform that standardising activity on our behalf, without modifying the text that we actually create in the first place.

Susan continued,

"Calais, and other text analytics tools, can sharpen search by identifying the information elements that people might be looking for, and tagging documents with them. These tools extract names of people, places, things, and types of events. They tag a document so that a search engine will incorporate these standard tags in their indices. When someone searches for IBM, they will get 'Big Blue' or 'International Business Machines', no matter how that name is expressed. Documents that discuss mergers and acquisitions will be found and grouped together for users to explore in an organized fashion. Search engines like Yahoo!, with their new Search Monkey platform are prepared to take advantage of these extractions. The upshot is that search results will be more precise. That's good news for searchers. Announcements like Search Monkey and Calais should kick off a new wave of semantic innovation and improved interaction."

Calais is an important example of web-scale infrastructure upon which a growing number of third parties could and should rely. With this release, the back-end web service becomes even more capable. More importantly (for now) the three externally-facing announcements will go a long way toward showing more people how semantic technologies can add value; easily, flexibly, and inside applications and user experiences with which they are already comfortable.