Cloud storage: Minimizing repair costs

Advanced error correction codes (ECCs) at the heart of modern storage are way beyond the RAID codes of the late 1980s. But today's codes force tradeoffs between reliability and repair costs. Can we have both reliability and low-cost data repairs?
Written by Robin Harris, Contributor

As the Big Data revolution continues its exponential growth, the problems of processing and protecting that also grow. Advanced ECCs - along with hardware and geographic redundancy - can ensure that data loss becomes less likely than an asteroid strike, at only a 40 percent overhead, much lower than the triple redundancy used in some cloud storage.

But when the data redundancy is compromised by a hardware failure - disk, server, or data center - redundancy must be restored. As the size of data restoration grows, the cost of repair becomes vital.

The major repair costs in a distributed storage system are bandwidth and computes. Bandwidth because the data has to travel across interconnects to get from the source data to the repaired data. Computes because the lost data was protected mathematically, and requires computes to reconstruct.

We have codes, such MDS (Minimum Distance Separable) codes, that are optimal for fault tolerance and capacity overhead. But these codes have a high bandwidth cost.

On the other hand, Pyramid codes are an example of non-MDS codes optimized for minimizing the number of nodes contacted, which reduces the bandwidth requirement, to reconstruct data. Researchers have derived codes that minimize bandwidth or maximize storage efficiency.

Improving codes

In a recent paper, Code Constructions for Distributed Storage With Low Repair Bandwidth and Low Repair Complexity researchers in Sweden, Norway, and France, present a solution to the problem of combining efficient storage with low-cost repair:

. . . we propose a family of non-MDS ECCs that achieve low repair bandwidth and low repair complexity while keeping the field size relatively small and having variable fault tolerance. In particular, we propose a systematic code construction based on two classes of parity symbols.

They propose two classes of parity nodes. The first class is constructed of an MDS code with added "piggybacks" on some of its code symbols, and is intended to provide ECC.

The second class of parity nodes uses a block code whose parity symbols are created with simple addition. This class is meant to reduce repair bandwidth and complexity by repairing failed symbols in the node.


Testing these codes the researchers found a reduction of repair bandwidth of anywhere from 30-64 percent compared to MDS codes. Given that network bandwidth is typically the most expensive part of the infrastructure, this is a significant economic advantage.

The Storage Bits take

Data centers are a significant consumer of electricity, and are growing faster than most sectors because of the growth of mobile services and big data. Making distributed storage systems more efficient benefits us all, environmentally, economically, and offers improved service quality.

It also is a reminder that we are very much in the early days of warehouse scale computing. If the industry is to mature rapidly, we need more research into improving efficiency at every level of the stack. If you design ECCs, I recommend this paper.

Courteous comments welcome, of course.

Editorial standards