With the advent of disk mirroring over 35 years ago, data redundancy has been the basic strategy against data loss. That redundancy was extended in the replicated state machine (RSM) clusters popularized by cloud vendors in early aughts, and widely used today in scale-out systems of all types.
The idea behind RSM is that running on many servers, with the same intial state, and the same sequence of inputs, will produce the same outputs. That output will always be correct and available if a majority of the servers are functional. A consensus algorithm, such as Paxos, ensures that the state machine logs are kept in sync.
Also: Can Samsung's 31TB SSD challenge hard drives in the data center? | Fail-slow at scale: When the cloud stops working | How the cloud will save -- and change -- disk drives
But there's no free lunch. When there's a failure - whether read, disk, or server - the system has to figure out how to retrieve and replicate the correct data. Which isn't always easy.
RSM uses several persistent data structures - logs, snapshots, metadata - to manage the data needed to maintain the cluster. And like any data, these structures are subject to the errors of storage, such as block errors, firmware and driver bugs, and read disturb errors. Block errors are common in both hard drives and SSDs.
Today a common approach to handling storage faults is crash the node that detects a checksum I/O error. That protects the data, but decreases availability. If there's silent corruption on some of the remaining servers, data can easily be lost.
What to do?
At this month's Usenix FAST '18 conference, Ramnatthan Altagappan et. al. presented the paper Protocol-Aware Recovery for Consensus-Based Storage that introduced a new approach to correctly recover from RSM storage faults. They call it corruption-tolerant replication, or CTRL.
CTRL constitutes two components: a local storage layer and a distributed recovery protocol; while the storage layer reliably detects faults, the distributed protocol recovers faulty data from redundant copies. Both the components carefully exploit RSM-specific knowledge to ensure safety (e.g., no data loss) and high availability.
CTRL offers several novel strategies to ensure data protection and high availability.
- CTRL distinguishes between crash and disk corruptions.
- There's a global-commitment determination protocol to separate committed items from uncommitted.
- Finally, a leader-initiated snapshotting subsystem provides identical snapshots across nodes to simplify and speed recovery.
The researchers tested these mechanisms against real world systems like LogCabin and ZooKeeper and found that they successfully hardened RSM storage against bugs that current recovery mechanisms did not. But at what cost?
Their testing showed that the storage layer that distinguishes between disk and crash corruptions added 8-10 percent overhead for disks and a minimal (<4 percent) for SSDs. Log recovery time, however, is significantly faster, dropping from over 1 second to just over 1 millisecond.
Considering the cost of corrupted data, that seems a fair price to pay.
The Storage Bits take
Reliable data storage is essential for our emerging digital civilization. While availability has improved markedly in the last 50 years, much remains to be done. After all, a well produced book will still be readable in 500 years, but other than M-discs, there are no commercial digital media that can meet that standard today.
But that's just one side of the problem. As stored data continues to grow exponentially, we need to keep driving costs down, including the costs of management and recovery. Minimizing the time to recover from corruption is a key step.
One of the great benefits of cloud storage is that it pays for the hyperscale vendors to invest in this kind of research, as Microsoft did. If they can shave 1 percent off their storage costs, that's tens of millions of dollars. And that ultimately benefits us consumers as well.
Previous and Related Coverage:
Fail-slow at scale: When the cloud stops working
Computer systems fail. Most failures are well-behaved: the system stops working. But there are bad failures too, where the systems works, but really s-l-o-w-l-y. What components are most likely to fail-slow? The answers may surprise you.
How the cloud will save -- and change -- disk drives
Google has changed many aspects of computer infrastructure, including power supplies and scale-out architectures. Now they're asking vendors to redesign disks for cloud use. How will that affect you?
Can Samsung's 31TB SSD challenge hard drives in the data center?
Hard drives are struggling to get to 14TB. Now Samsung is kicking sand - literally - into their faces with a new, 2.5-inch 31TB SSD. You and I will never buy one, but still, it's pretty amazing.
Courteous comments welcome, of course. BTW, Protocol-Aware Recovery for Consensus-Based Storage won the FAST Best Paper award.