How efficient can storage be?

We want our data protected from failures. After a failure we want our data back quickly. And we want to pay as little as possible. How?

Recent research by hyper-scale system managers - mostly Microsoft and Facebook engineers and scientists - has tried to answer that question. And the answers are way better than what we have today.

In XORing Elephants: Novel Erasure Codes for Big Data, authors Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Megasthenis Asteris, Dhruba Borthakur and Dimitris Papailiopoulos of USC and Ramkumar Vadali and Scott Chen of Facebook delve deeply into the issue. Technically it is related to work that Microsoft presented last year.

RAID repair problem
Standard Reed-Solomon erasure codes - the kind used in almost every RAID array today – isn't well suited to today's hyperscale demands. Back when RAID was new the alternative was disk mirroring. Compared to that RAID looked like magic.

Reed-Solomon codes suffer from an efficiency versus repair trade-off. In order to make a RAID5 or RAID6 reasonably efficient you need a wide stripe across at least 8 drives - and more is better.

But a wide stripe makes the time to repair a failed disk much longer. Data has to be transferred from every other disk in the stripe, using up scarce disk I/Os and internal storage bandwidth.

Replication nation
To avoid this problem early scale-out storage dispensed with Reed-Solomon codes in favor of double or triple replication. Using inexpensive disks rather than expensive arrays made this economical.

But exponential data growth has overwhelmed the ability of even the wealthiest web companies – Google, Microsoft, Amazon, Facebook and others – to build infrastructure fast enough and large enough to handle the tsunami of data. Something had to give and triple replication was it.

Major Web storage infrastructures are settling on a level of redundancy that standard RAID6 cannot approach. These systems can now typically survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with an overhead of only 40%.

Xorbas the Geek
Facebook has enormous storage problems. Their large analytics clusters can have 3000 nodes each storing approximately 15 TB of data.

That's 45 PB of raw capacity. For analytics. Whoa!

The data is stored in a highly scalable Hadoop cluster. The authors of the paper developed a new family of erasure codes called Locally Repairable Codes or LRCs that are efficiently repairable in disk I/O and bandwidth requirements.

They implemented these codes in a new Hadoop module called HDFS – Xorbas. They then tested this system in clusters on Amazon and within Facebook.

While the codes are slightly less than optimal in storage capacity, they recover quickly from failures while using much less network and I/O. Reduced network use makes them even more suitable for the geographic distribution of data, the next step in increasing availability.

The Storage Bits take
Good news: Facebook will be able to store massive amounts of data more efficiently than ever before. Bad news: so will the NSA and everyone else.

These codes aren't directly applicable to home and small office storage today since they require dozens of nodes to be efficient. But over time we could see small-scale, highly resilient storage that uses these codes as processors get more powerful and more people recognize the importance of failure-proofing their data stores.

Commments welcome, as always. Do you think the web industry is doing enough to reduce their environmental impact?