Back at the Sun Microsystems JavaOne 2009 conference last year I first heard about the archive.org Internet Archive idea. Fellow journos and I nestled into a back room deep in the Moscone Center’s underbelly to hear of plans to capture every page on the web and store it in an ISO container filled with 60 of the company’s Sun Fire X4500 Open Storage Systems.
The concept here is as follows: while we might have the Book of Kells, the Domesday Book and (ah-hem) the Encyclopaedia Britannica to refer back to for some idea of what the last 2000 years has entailed, modern history runs the risk of being lost as it is increasingly chronicled on ephemeral websites that by their very nature change constantly.
Out of interest I asked the project’s manager whether they keep ALL the content that’s out there including the ‘fruity’ pages featuring ‘artistic pictures’ and (god forbid) the bad stuff – and yes they do.
In a similar vein, today sees news of IBM doing something rather similar.
Big Blue is working with the British Library on a project that will preserve and analyse terabytes of information on the web before it is lost forever. A new analytics software project, called IBM BigSheets, will help extract, annotate and visually analyse meaty chunks of web data via a browser.
IBM cites recent research estimating that the average life expectancy of a website is just 44 – 75 days and that every six months, ten percent of web pages on the UK domain are lost.
Essentially this story is about a) retaining web data but also b) unlocking the embedded information on the web. This itself seems like a good idea as the web doesn’t really have an “index” so to speak. Search engines do a damn fine job, but these can be influenced.
Along with the 150 million maps, manuscripts, musical scores, newspapers and magazines that it must archive every year, the British Library has been already been archiving selected web pages from the UK domain since 2004. It is hoped that BigSheets will allow library goers to access decades of archived web pages in the future.
IBM says that this year, the amount of digital information is expected to reach 988 exabytes, which is the equivalent to a stack of books from the Sun to Pluto and back. Within this volume of data -- both structured and unstructured – there is a constant need to search and that is what the company is aiming to achieve.
The company says that, “By building on top of the Apache Hadoop framework, IBM BigSheets is able to process large amounts of data quickly and efficiently. IBM BigSheets is a new technology prototype. BigSheets is an extension of the mashup paradigm that integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories; collects a wide range of unstructured web data stemming from user-defined seed URLs; extracts and enriches that data using an unstructured information management architecture; and lets the user explore and visualize this data in specific, user-defined contexts.”
So whether it’s Sun or IBM doing this kind of thing, it is pretty interesting stuff. I suppose it’s worth trying to remain technologically agnostic if you are engaging in this kind of project so that you ensure open standards are adhered to. Microsoft has done some work with the British Library also, but apparently took a rather more propriety approach. What could be worse than preserving cherished information forever and then future generations come to it but find they don’t have the key to open the box?