Serious cloud users know the vendor story: multiple datacenters, geograpically distributed; advanced erasure coding that is better than RAID 6 (which I've discussed); multiple version retention; checksums to ensure data integrity; and synchronization across devices. What could possibly go wrong?
As has been documented, client-side corruption is all too common, so the cloud will carefully preserve and spread corrupted data. If you crash during an upload the data may be inconsistent - but the cloud doesn't know that - or the cloud may fail to sync changed files.
Worse, clients cannot typically preserve dependencies between files since uploads are not point-in-time snapshots, creating unexpected and unwanted application (mis)behavior. A group of linked databases - say, between CRM, ERP and distribution systems - could end up inconsistent due to piecemeal uploads of changes at different times.
The basic issue is that the loose coupling between the local and cloud file systems leaves data less protected than users - or cloud vendors - like to admit. Like most problems it is fixable, once we admit we have a problem.
In a not-yet-online paper to presented at the FAST - File And Storage Technology - conference tomorrow, researchers from NetApp and the University of Wisconsin-Madison present a solution they call ViewBox.
Built on the popular ext4 file system, ViewBox has three key components:
Checksumming that detects corrupt and inconsistent data
A view manager that creates and exposes views to the synchronization client
A damaged data recovery daemon, that handles the server backend independently of the client
The team integrated ViewBox with Dropbox and Seafile, two popular sync services. Viewbox ensures that the local file system and the cloud services cooperate to detect and recover from these failure modes, at a runtime speed penalty of 5% or less.
The Storage Bits take
Obviously today's file systems were not built to handle backend cloud storage. How could they have been?
But now the low cost and resiliency of cloud storage has made it a go-to resource for many IT pros. Not a problem with archiving, but as more timely data is passed into or through the cloud the greater the chance for problems.
Linux users will probably get a solution like ViewBox sooner than either Windows or OS X users. But the real problem will be convincing users that there is a problem that will cost them. Even today Apple fans often refuse to recognize HFS+ data integrity problem
But research like this will help focus OS teams on the problem, hopefully to speed a solution to market.
Comments welcome, please. The name of the paper is ViewBox: Integrating Local File Systems with Cloud Storage Services, by Yupu Zhang, Chris Dragga, Andrea C. Arpaci-Dusseau† and Remzi H. Arpaci-Dusseau.