While in no way unique in facing the challenge of how to preserve digital documents, the UK's National Archives certainly faces the issue on a larger scale than most organisations.
The National Archives (NA) is the UK government's official archive and contains around 900 years of historical material. It's not easy to put an exact figure on the amount of information the organisation is tasked with preserving, but in an average year the government produces around 150km of paper-based documents, and the archive deals with around 1.5km of this. When you include the huge amounts of digital documents that government departments are now mandated to produce where possible, and a plethora of different websites, it's fair to say that the archive's chief information officer, David Thomas, has some job on his hands.
Thomas, a former archivist who has been at the organisation for most of his career, is charged with a very important function: to oversee the IT aspects of safeguarding government documents for future generations, for historical and legal purposes. This ranges from helping with an increasing number of Freedom of Information requests, to overseeing the preservation of reports from the Bloody Sunday inquiry.
"We take government records when they are 30 years old and make them available to the public. In the electronic world, it's more recent than that. Records of royal commissions or public enquiries we take pretty much as soon as the public inquiry is released — inquiries into sunken ships, for instance," he says.
Despite a relatively low profile, the NA attracted some attention recently when it issued a joint press release with Microsoft detailing how the software maker was providing the NA with access to its Virtual PC 2007, which allows previous versions of Windows and Office to be used side by side on a single PC. This is a process known as emulation, and allows documents to be viewed on modern desktop platforms, while remaining in the same format in which they were created.
Microsoft UK managing director Gordon Frazer was keen to point out that his company was working hard to avert a "digital dark age" caused by the incompatibility of old electronic documents with new formats. However, critics would point out that Microsoft's aggressive and proprietary upgrade policy for its Windows and Office platforms are part of the problem.
An emotive issue
Rather than making Microsoft appear heroic about a problem that it had contributed to, the release highlighted a bigger issue that goes to the heart of digital preservation. As well as using the Virtual PC 2007 software to emulate older versions of Microsoft software, the NA also commented on its ongoing work to convert documents into open file formats. Mentioned as it was in a Microsoft press release, some concluded that this meant the NA planned to adopt Microsoft's Open XML document format, which has been criticised by open-source advocates for not being very open at all.
"If it were, Microsoft wouldn't need to make Novell and Xandros and Linspire sign NDAs and then write translators for them," wrote Pamela Jones, open-source expert and editor of the Groklaw blog.
When pushed on whether the NA has plans to use Microsoft's Open XML format exclusively, Thomas is keen to point out that his organisation is not tied to any one technology, but will use the best tools available to do the job at hand.
"All things are open on the table at the moment. What I think we are going to have to do is look at what is available to people on their desktop at a particular time and we will migrate to a format that they can read," he says. "Whether it's Google Docs, whether it's open document format or Microsoft Word, we will have to make judgements. The crucial thing is that the information is going to be readable using the standard tools you find on the desktop — we are not rigidly bound to one approach."
However, Thomas claims he is aware that there is a lot of support for alternative open-source formats such as Open Document Format (ODF) and the idea of not locking public documents into a format such as Open XML that is mainly championed by one vendor.
"For people involved in the debate it can be a very emotive issue — particularly the opponents of the Microsoft approach. We are neutral; we welcome open-source software because it makes our lives easier," he claims.
But although he supports an open approach to digital data formats, Thomas does not think it's his place to mandate the use of open source within the NA or the government as a whole. Critics of the Microsoft approach...
...argue that open source offers a fundamentally more future-proof data format because of the sheer number of organisations involved and the information-sharing that results.
"We are not, ourselves, involved in that area. Clearly there is a debate about which open document format people choose, and frankly we are agnostic about that. We will take what people give us but we are not wedded to one format," he says. "We take the files that people send us and it's not for us to get involved in debates about where the best way forward is. In a way, the market will decide, whatever the rights and wrongs are. It's like VHS and Betamax — whatever the best technology is, the market will decide. I don't think us expressing strong views either way is relevant."
But the controversy around translating documents to open file formats doesn't end there. Outside the open source versus proprietary debate, there are arguments within the archive community about whether documents should be translated at all. Some archiving purists claim translating documents is a crude approach to preservation and can be likened to translating a poem into a different language, and then destroying the original. Instead, computer scientists such as the Rand Corporation's Jeff Rothenberg claim the emulation approach, such as the NA's Virtual PC 2007 project, is a much more sensitive strategy.
"Not only does each translation lose information, but translation makes it impossible to determine whether information has been lost, because the original is discarded," wrote Rothenberg, in a 1995 Scientific American article entitled Ensuring the longevity of digital information.
However, Thomas argues that for the huge volumes of information the NA has to deal with, translation is the only practical approach. "Generally in the digital-preservation world, I think it is accepted that for bulk operations, migration is the most practical, the most cheap and the most robust approach, and also, crucially, it means you can read the stuff at home," he says. "If you want to read an old WordStar document, we can migrate it to the latest version of Word or whatever and you can read it on your browser at home."
As well as electronic and digitised paper documents, an increasing part of the work done by the NA is around storing video which, according to Thomas, has its own problems. "We get odd video formats and they have to be a bit more hand-crafted. There are a few oddities where we have to figure out what to do with them," he says. "It is still a very tiny percentage of the documents we have to deal with, but we have some very tricky problems. Public inquiries into the loss of ships involve building these 3D virtual-reality models of how the ship sank, and they pose quite a problem for us."
The challenge of changing formats
The issue of how to deal with awkward formats such as video is related to a bigger challenge being faced by the NA. As the organisation only really gets its hands on some electronic documents after 30 years, there is a long period where important documents are out of its control, explains Thomas.
"The big issue that is facing government records is not how long they will survive in the National Archive, because that is pretty well-managed, but how long they will survive in government departments before they even come to the archive," he says. "The issues government departments have is that they generate a vast bulk of electronic records and after two years or so you don't have to consult them on a day-to-day basis, but there are things that you may need to consult some day down the road, and if you don't have some way of preserving them you won't be able to read them."
The police service is a good example of how crucial data-preservation is to some areas of government. If there is an unsolved crime, the police have to...
...keep the records for 75 years, and if it's a serious crime where someone is convicted, they have to keep the records for the life of that person. "Back in the days when that was a few paper files and a bit of DNA, that was OK. Now the police have discovered video cameras, and they have video cameras by the side of the road, in patrol cars, and even police dogs have their own video cameras now — and that is a huge preservation problem," says Thomas.
To answer this problem of storing intermediate documents, the NA has been allocated some money from government to set up a shared service for digital preservation across government, not run by the NA but contracted out. What the business model will be and how the system will work exactly is still unclear, but Thomas claims departments will have to be judicious about what they choose to archive.
"The vast bulk [of documents] will not survive because we are not interested in video films of the M1, although they might prove to be the most interesting in the long term — police car chases and dogs etc might be the most interesting," he quips.
As well as the pressure to keep up with video and the huge swathes of electronic documents, the NA is also charged with archiving government websites, a massive challenge on its own given the explosion in online public services in the last few years.
To take the burden off Thomas's relatively small IS team of around 50 staff, the NA has a contract with a organisation called European Archive to capture about 65 government websites. Some are sampled every six months, others every month, and a snapshot is always taken of any government website that is due to be shut down.
A daunting task
But even with outside help, Thomas concedes that keeping up with the sheer volume of information being added to government sites is daunting.
"There are lots of problems with websites and we are working very hard to deal with them. One problem, and it's a conceptual one, is that years ago you had government departments and they created paper records and we would select some of those and they would come here, but now you have government departments and outside of that you have things that are government funded — things like Theyworkforyou.com — the website about how MPs behave. The space in which we operate has expanded — it's not as clear cut as it used to be in the old days," he claims.
The flexible and interactive nature of the web makes sites easy to update, but that has repercussions when it comes to data preservation, says Thomas. "Things get lost on websites, URLs change, and people delete things and move things in a casual and random fashion. It's not a UK government problem, it's a worldwide problem that things get deleted," he says.
To give it a fighting chance, the NA works with an organisation called the European Archive which is a branch of the Internet Archive founded by MIT graduate Brewster Khale as part of his plan to capture all human knowledge. "They started in 1996 and sort of expanded since then. What is scary is that very little has survived since before 1996. When it was the tenth anniversary of our website I thought we would do a little exhibition, but we have lost the first three or four years of our website," says Thomas.
Aside from simply capturing websites, the next big thing in web archiving is being able to search them, says Thomas. "At the moment you have to know the URL and what year you want, but what we are going to do next year some time is use some kind of search tool to search this archive of government websites — maybe we will use Autonomy, or maybe we will use Google but what we have to do is begin to search them," he says.
Top tips for data preservation
What advice would Thomas give to other organisations facing up to their own data preservation challenges? He offers three steps are a good starting point.
"Firstly, metadata is crucial — you have to have metadata that identifies the documents that you want to keep. Selection is also very important. You need to put your preservation resources into a small number of things. You should only try and keep what you want to keep," he says. Finally he claims that a live approach to preserving documents has to underpin the entire strategy. "It is much better to keep things on live systems on servers that are backed up, than it is to put things on CDs and put CDs in drawers."
The National Archive is well funded — though maybe not to the same degree as its US equivalent, The National Archives in Washington — but Thomas seems confident about the challenges ahead. The next big step for the organisation is to fully embrace the web, and be able to deliver a 99.9999 service level to the information-curious public perusing its site. This is a massive undertaking and will include setting up a mirrored hosting centre off-site next year to ensure continuity of its web operations.
Archiving the nation's most important documents has occupied most of Thomas's career to date, but it's only in the past three to four years that digital documents have begun to take the majority of his time. And given the predictions for the growth in digital media over the next decade, he isn't likely to be out of work any time soon.