In the paper Gray Failure: The Achilles' Heel of Cloud-Scale Systems computer scientists Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure, banded together to explore the gray failure problem.
The downside of hyperscale
They define gray failures as
. . . component failures whose manifestations are fairly subtle and thus defy quick and definitive detection.
These subtle failures can lead to bad performance, lost packets, faulty I/O, memory thrashing, and non-fatal exceptions.
Naturally, as the number of infrastructure components increases, so does the number of gray failures. This is hyperscale's dark side.
While occasionally slower performance may seem a small price to pay for the benefits of cloud services, the danger of gray failures is far greater. As gray failures accumulate, the stress on healthy systems grows, and can lead to a cascading, headline-grabbing massive outage.
Gray failure's deep roots
Fault-tolerant systems rest on three pillars: redundancy, failure detection, and failure recovery. Redundancy is a given in cloud infrastructure. The problems come in failure detection and recovery.
The coders who write the software layer are rarely expert in the hardware that makes up the infrastructure. Often they make simplistic assumptions about how it fails and what needs to be detected.
But as any hardware engineer can tell you, there are many places hardware can go wrong without crashing or smoking. Intermittant hardware glitches, memory leaks, buffer overflows, and background jobs can all lead to a reduced performance or intermittent gray failure without an overt symptom that leads to a system reboot or hardware replacement.
The key symptom of a gray failure is what the authors call differential observability. If a server has slowed to a crawl, but its heartbeat is regular, an observing system won't see a problem, but a client system will. That's differential observability.
That leads the authors to make several recommendations to better detect and correct gray failures.
- Don't rely on a single indicator, such as heartbeat, for system health.
- Try to take an application view, rather than a hardware view, to detect gray failures.
- Leverage scale for detection. For difficult gray failures you may need to collect observations from thousands of servers and use statistical inference to find the gray-failed components.
- Temporal analysis. Tracking significant failures back in time to understand the small faults that led to the outage helps sharpen up the detection process.
The Storage Bits take
Gray failures are an extension of a class of bugs that the late, great, Jim Gray called "Heisenbugs", transient errors that disappear when you try to observe them due to subtle differences in initial conditions. Because of their transitory nature, no single tool or metric will capture them.
Does this mean that cloud infrastructures are doomed to fail under the weight of their increasing size and complexity? No. But it does mean that the tools used to manage them must become more sophisticated.
And infrastructure architects must become mindful of the subtleties of gray failure interactions with system design as discussed in the paper. For example, the counter-intuitive finding that greater redundancy can lead to lower availability.
Courteous comments welcome, of course. Bravo to Microsoft Research and the Azure folks for publishing this paper. It's nice to know that MS has some very smart people minding the store.