A 350 year old copy of Shakespeare is about as readable as a new one. But a 35 year old floppy? Preserving data is essential to digital civilization, but how? Here's a new approach.
I'm at the Storage Networking Industry Association's Storage Developers Conference in Silicon Valley. Sam Fineberg, HP Distinguished Technologist, gave a talk on long-term digital data preservation. These are my notes.
SNIA surveyed businesses about their data retention requirements. 68% of organizations needed to preserve data for 100 years or longer.
Data is fragile. Threats include:
- Media/hardware obsolescence even if you have an 8 inch floppy drive, there may not be hardware capable running the software required to read it, let alone the application to open the files on the floppy.
- Software/format obsolescence. Remember WordStar?
- Lost context/metadata. A document's contents may appear mundane, but if it is from the President to the Secretary of State, its context makes it important.
- Human error
- Media fault
Preserving bits is hard
Saving 1 PB for 50 years, with a 50% chance of damage gives a bit half-life of 1017 years. That isn't achievable for large data sets.
There is no simple technical fix: we can't predict change but know it will occur. Processes are key. Processes for data preservation must evolve to get us to the next step. Standards make it easier, but aren't the whole answer.
What to preserve?
Bits? Applications? Context?
Is it even possible to preserve everything? For example, with an old book: the content? Paper wear? Political context? Bookplate? Where it falls open?
We will lose information moving from physical to digital. And we can't know what future generations will consider valuable. For example, scientists collect old hollow metal buttons because they contain air samples from when the buttons were made. Who dreamed 150 years ago that would be valuable?
Preservation must facilitate storage of objects. Map to a wide variety of devices and technologies. Resilient.
SIRF: Self-contained Information Retention Format. SIRF is the digital equivalent of a physical container that archivists already know how to manage. SIRF containers hold preservation objects, a catalog and an object that labels the SIRF container.
SIRF maintains referential integrity, links between objects and context. Any SIRF compliant app can read and interpret the objects. Objects are migrated easily.
A couple of use cases show some of the problems:
- Legal holds and e-discovery. In civil suits the parties are required to preserve all requested documents - legal hold - under threat of severe penalties. But not all documents are included, such as client-attorney emails. How can all documents be preserved and the right ones selected for disclosure?
- Biomedical info. Medical images are needed for patient history. But what if the patient was 12 years old and now is an adult? How do we protect their privacy and ensure that only the "right" adults now get access to it?
The Storage Bits take
Massive data loss can threaten civilization. The burning of the ancient Library of Alexandria, destroying hundreds of thousands of handwritten books, contributed to Europe's Dark Ages as knowledge of ancient art, science and math were lost. The little recovered through Muslim scholars helped create the Enlightenment, but how much more was lost?
But the threat of digital data loss is far larger. Cheap storage and sophisticated data mining allows us to derive value from datasets that once we couldn't even afford to collect, let alone analyze.
This is important work.
Comments welcome, of course.