A 350 year old copy of Shakespeare is about as readable as a new one. But a 35 year old floppy? Preserving data is essential to digital civilization, but how? Here's a new approach.
I'm at the Storage Networking Industry Association's Storage Developers Conference in Silicon Valley. Sam Fineberg, HP Distinguished Technologist, gave a talk on long-term digital data preservation. These are my notes.
The problem SNIA surveyed businesses about their data retention requirements. 68% of organizations needed to preserve data for 100 years or longer.
Data is fragile. Threats include:
Preserving bits is hard Saving 1 PB for 50 years, with a 50% chance of damage gives a bit half-life of 1017 years. That isn't achievable for large data sets.
There is no simple technical fix: we can't predict change but know it will occur. Processes are key. Processes for data preservation must evolve to get us to the next step. Standards make it easier, but aren't the whole answer.
What to preserve? Bits? Applications? Context?
Is it even possible to preserve everything? For example, with an old book: the content? Paper wear? Political context? Bookplate? Where it falls open?
We will lose information moving from physical to digital. And we can't know what future generations will consider valuable. For example, scientists collect old hollow metal buttons because they contain air samples from when the buttons were made. Who dreamed 150 years ago that would be valuable?
Preservation must facilitate storage of objects. Map to a wide variety of devices and technologies. Resilient.
SIRF's up SIRF: Self-contained Information Retention Format. SIRF is the digital equivalent of a physical container that archivists already know how to manage. SIRF containers hold preservation objects, a catalog and an object that labels the SIRF container.
SIRF maintains referential integrity, links between objects and context. Any SIRF compliant app can read and interpret the objects. Objects are migrated easily.
Use cases A couple of use cases show some of the problems:
The Storage Bits take Massive data loss can threaten civilization. The burning of the ancient Library of Alexandria, destroying hundreds of thousands of handwritten books, contributed to Europe's Dark Ages as knowledge of ancient art, science and math were lost. The little recovered through Muslim scholars helped create the Enlightenment, but how much more was lost?
But the threat of digital data loss is far larger. Cheap storage and sophisticated data mining allows us to derive value from datasets that once we couldn't even afford to collect, let alone analyze.
This is important work.
Comments welcome, of course.