Amazon's experience: fault tolerance and fault finding

Fault-tolerant architectures have risks that are magnified by natural human tendencies. A new Google paper hints at the larger issues.

Fault-tolerant architectures have risks that are magnified by natural human tendencies. A new Google paper points to new risks with fault-tolerant infrastructures.

Fault tolerance breeds complacency In 1997 Sun introduced its first fully redundant storage system: I/O channels; power supplies; cooling; and data. But within weeks we got field reports of total system failures.

The problem was simple: a component failed but because the system kept running and error messages were ignored a 2nd fault brought the system down. Oops.

We fixed it not with triple redundancy but by making sure that faults were harder to ignore. Compare that solution to the 5X redundancy of computer systems on board the Apollo moon shots.

Google's experience A Google engineering team commented on this in a recent paper on their Megastore system. They found that fault tolerance

. . . often hides persistent underlying problems. We have a saying in the group: “Fault tolerance is fault masking”. Too often the resilience of our system coupled with insufficient vigilance in tracking the underlying faults leads to unexpected problems: small transient errors on top of persistent uncorrected problems cause significantly larger problems.

This is a bigger issue because of the way it intersects with human psychology: we want to believe that the multibillion-dollar infrastructures we've built our strong and safe; we get bored looking for things that probably aren't there; and the intersection of the transient with an underlying fault is very difficult for engineers to imagine or guard against.

Bottom line: humans aren't built to handle this class problem, so we need our machines to help.

Tolerate or amputate? The Google team makes another important observation about fault-tolerant infrastructures: a system that tolerates faulty participants can miss their larger impact on the total infrastructure.

For example, if an algorithm tolerates slow participants, and a slow system is interpreted as a fault, the algorithm can throttle the entire group as it “handles” the fault of a slow machine. A chain gang is only as fast as its slowest member.

Sometimes the right answer is not to tolerate but to amputate a "fault" and let the rest of the system get on with the work. But how do you tell the difference?

The Storage Bits take All systems have faults. The interesting question is: what happens then?

In the case of Amazon's outage I expect the root cause analysis will show an unexpected interaction of a transient fault with an underlying fault. And the solution will be improved instrumentation and notification to admins. And maybe an "amputation" feature in the software.

But at bottom we live in a statistical universe: a larger user base finds more bugs than a smaller user base. And Amazon's user base is growing rapidly.

This particular problem is unlikely to recur. But other nasty bugs are out there and will bite us.

Is this a reason to run screaming from cloud technology? No. But know that there will be other failures even as the failures grow rarer each year.

Comments welcome, of course. I wrote about and linked to the Google paper on StorageMojo.