Regulatory compliance and business intelligence systems have opened the eyes of many companies to a new way of thinking about managing data. But keeping data organised and accessible for a few quarters is one thing — what will happen to it 20 years from now?
As the digital world continues to mature, with digital information reaching a critical mass in all areas of life, that's a question organisations are starting to ask. Despite the fact that digital information has been around for decades, there is still no tried-and-tested way of keeping data intact beyond the next time a medium or file format becomes obsolete — much less of dealing with the surprisingly short physical lifespan of the media.
This year marks a turning point in the digital world, IDC argued in a recent white paper: for the first time, the amount of information created — around 260 exabytes — will surpass the storage capacity available. The figure is symbolic, since much of the information generated doesn't need to be stored, but it underscores that the digital world has matured, something that has far-reaching implications for how companies manage and store their data.
On the management side, the past few years have seen a carrot-and-stick approach to change. Regulatory compliance has forced companies to come up with strategies for dealing with particular types of data — overall, 20 percent of the digital universe is subject to compliance rules and standards, according to IDC's estimates. And, meanwhile, business intelligence (BI) systems have shown companies that, if they are organised enough with their data, it can pay off.
"Companies are perceiving a higher value in their information," says IDC analyst Marcel Warmerdam. "The idea is you can capture everything, and then, within the numbers, could be found the solution to profitability, if you can just grab it. BI systems can do that."
The longer-term issues, however, remain more of a mystery. A paradox of the digital world is that, as the ability to store bits increases, the ability to store them over time decreases, something that can be seen in the worryingly short expected lifespans of digital media.
The design life of a low-cost hard drive is five years, while the usable lifespan for magnetic tape could be as short as 10 years, and optical media such as CDs and DVDs may become unusable in just 20 years.
Where digital information is concerned, physical degradation is the least of the conservation problems. The more pressing issues are to do with obsolescence at all levels, including the media, the file formats and the software used to read the files.
All this has been talked about for years, but it's only now that serious efforts are finally getting underway to come up with large-scale, practical answers. Some initiatives are focusing on standards and best practices that can simplify long-term storage, while others, notably those of large university or national libraries, are putting trial systems in place.
"This is happening now partly because of the realisation that the digital world is really upon us now, in a big way," says Richard Masters, programme manager of the British Library's Digital Object Management scheme. "Until 2002 or 2003, a lot of our digital material was digitised — you could always go back to the original. Now we've reached a critical mass of material that exists only in digital form."
The outcome of all this experimentation should be that, somewhere down the line, there will be tools and a body of knowledge for companies to draw on in dealing with their own stacks of mouldering disk drives and dusty reels of magnetic tape.
In a famous January 1995 Scientific American article, RAND Corporation computer scientist Jeff Rothenberg noted a disheartening fact about digital objects: the things that make them difficult to preserve are precisely those aspects that make them interesting and attractive in the first place.
In the article, Ensuring the Longevity of Digital Information, an expanded version of which can be found online, Rothenberg argues against the notion that standardised formats can be a solution to preservation problems — a concept underlying, for instance, the current debate over the standardisation of Microsoft's XML-based document formats, and that of the OpenDocument Format (ODF).
Rothenberg says it's an illusion to think that even something as simple as a word-processing document format can be encapsulated in a long-term standard. "The incompatibility of word-processing file formats is a notorious example — nor is this simply an artefact of market differentiation or competition among proprietary products," he wrote. "Rather it is a direct outgrowth of the natural evolution of information technology as it adapts itself to the emerging needs of users."
The same goes for every other type of file format, Rothenberg argues. From the point of view of preservation, this means standards can't save the day — formats will continue to evolve. Moreover, they'll undergo "paradigm shifts" in which the old ways of thinking are, as often as not, swept away.
One answer to this continual change is to continually migrate, or translate, documents into current formats — an approach adopted by the British DOM programme, for one. But paradigm shifts mean that...
...translation can be difficult and expensive, and may deliver something substantially different from the original — as with old static databases that had to be redesigned to fit the relational database model.
In effect, preserving digital objects by migrating them to current formats isn't really preserving them at all, Rothenberg argues — it could be compared to translating a poem into a different language, and then destroying the original.
"Translation is attractive because it avoids the need to retain knowledge of the text's original language, yet few scholars would praise their ancestors for taking this approach," Rothenberg wrote. "Not only does each translation lose information, but translation makes it impossible to determine whether information has been lost, because the original is discarded."
An alternative is emulation, in which the hardware, operating system and applications needed to view an original application are all simulated using current technology, an approach Rothenberg favours and which was pioneered in practice by fans of obsolete video games in the 1990s. This has its own complications, but at least it is a way of keeping documents accessible in their original state.
Besides the issues around obsolescence of file formats, applications, operating systems and hardware, there is the more basic question of how to deal with the fact that media physically degrade or become obsolete.
How long will various media types last? There's considerable controversy around the issue, with Kodak claiming in one report that its writeable CDs would last 217 years under certain conditions, while others observe that such media start to degrade after only a couple of years. Rothenberg estimates that optical media have a practical physical lifetime of five to 59 years, digital tape two to 30 years and magnetic disk five to 10 years.
There's just one problem with such estimates, though — they're all academic, because, with the fast pace of change in the IT industry, any given medium will be obsolete in about five years. Even if it continues to function, modern hardware may not be able to read its contents or even connect to it.
"Digital information lasts forever — or five years, whichever comes first," Rothenberg quipped.
That means any organisation that wants to keep its data accessible will have to look forward to an unbroken chain of migrations within a time cycle short enough to prevent the media from becoming physically unreadable or obsolete before they are copied. "A single break in this chain can render digital information inaccessible — short of heroic effort," Rothenberg wrote.
Things look quite different from the point of view of the archivists who deal with questions of preservation on a practical level. The daunting prospect of future paradigm shifts, for instance, is nothing new — archivists and records managers are trained with the understanding that future generations may well disagree with their choices about what to keep and what not to keep, and how objects are preserved. "As a records manager, you have to accept that whatever you do will be wrong," says Anna Riggs, an archivist with Birmingham City Council.
A number of institutions are now putting long-term digital preservation programmes into place, including the British Library, the Library of Congress, the National Library of the Netherlands (the KB) and the California Digital Library, among others.
Other organisations are working on infrastructure and standards designed to back up such programmes. The EU-funded Planets (Preservation and Long-term Access through Networked Services) project, for instance, is co-ordinating European national libraries and archives, research institutions and IT companies to address digital preservation issues. The Digital Preservation Coalition is doing similar work at a UK level. Meanwhile the Storage Networking Industry Association (SNIA) has established the 100 Year Archive Task Force, which is aiming to come up with best practices for long-term data retention.
The SNIA is also working with the storage industry on Extensible Access Method (XAM), which is expected to produce interfaces between applications and storage systems that co-ordinate metadata to stabilise interoperability, storage transparency and automation for what's known as information lifecycle management (ILM), sometimes called data lifecycle management.
This all sounds very organised, but it masks the absolute uncertainty...
...that underlies any long-term storage project. The British Library, for instance, found that storage-industry concepts such as ILM were quite unsuited to the type of archive it is establishing with DOM. ILM establishes practices for migrating data from fast, high-performance storage to lower-performance media as the value and use of the information decreases, but "this view of storage is at odds with our own view", wrote the British Library's Richard Masters in a white paper. That's because the Library doesn't judge the value of its objects, and doesn't intend ever to delete them.
The Library has gone through several attempts at building a long-term digital archive since the late 1990s, including calling in IBM to build a complete system from its specifications — the approach used by the KB, although on a smaller scale. None of the projects came to fruition.
"Then we realised as an organisation that the big-bang approach was never going to work. Nobody knows what the requirements are," says Masters. "That's why we are building DOM in a component fashion and learning as we go. That way we don't have a huge risk — we aren't building an expensive application that doesn't meet our needs."
The library's key requirements are for its digital objects to be available forever, but at a very low, though undetermined, rate of access. That means the system has to be durable, flexible and affordable to maintain, but doesn't have to offer the high speed required by enterprise storage systems.
The initial system is built in two redundant sites, each growing to about 300 terabytes, using commodity magnetic disk drives on the relatively new Serial ATA standard. That means the hardware system is independent of any one vendor; the library plans to simply replace drives with newer ones as they reach their end of warranty. The initial tender went to VSPL, which proposed a solution using JetStor disk arrays. The software layer is designed to be independent of the technical properties of the physical storage itself.
Aside from the two main sites, there will also a third "dark archive", designed as a way back from total failure of the two main sites. The details of this are still being worked out, but the idea is for it to be in a completely separate repository using a totally different technology.
The British Library's choices in some key areas underscore the degree to which the field is divided over best practices. For instance, the library has decided that the migration approach — translating from old formats to new formats — is most appropriate for its archive. "Emulation versus migration is one of those religious wars in the archives community," Masters concedes.
Work is also being done around turning Microsoft's Office Open XML file formats into open standards, bringing it into conflict with supporters of ODF and those who believe Office Open XML will extend Microsoft's control over the creation of documents. "There are billions of Word documents out there, and, if those were opened up, it would be a huge resource," says Masters.
He argues that the only thing organisations really know about digital preservation at the moment is how little they know. "It's a learning curve. We've put together the best thing we can for now, and we'll run with it for a time and accept we're going to make changes," he says. "Openness is important on this — we've got to learn from our experiences and share that with others. Experience is the only thing that will get us moving forward on this."
While few companies currently have to deal with the issues the library is tackling now, they are likely to have to do so at some point in the future, adds Masters.
"This will become mainstream. The technologies we are developing may end up being built into some storage products as standard. A lot of tools will be made available through the work that's going on now," he says. "If there's one thing that's certain, it's that digital records will keep increasing. They aren't going away."