X

Innovation

Home Innovation Cloud

How clouds fail

The massive redundancy of cloud infrastructures make them seem virtually bulletproof. Not true. In fact, so-called "gray failures" are the real problem for cloud apps - and ones that often go undetected by today's management systems. Here's what you need to know.

Written by Robin Harris, Contributor July 24, 2017 at 5:23 a.m. PT

In the paper Gray Failure: The Achilles' Heel of Cloud-Scale Systems computer scientists Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure, banded together to explore the gray failure problem.

The downside of hyperscale

They define gray failures as

. . . component failures whose manifestations are fairly subtle and thus defy quick and definitive detection.

These subtle failures can lead to bad performance, lost packets, faulty I/O, memory thrashing, and non-fatal exceptions.

Naturally, as the number of infrastructure components increases, so does the number of gray failures. This is hyperscale's dark side.

While occasionally slower performance may seem a small price to pay for the benefits of cloud services, the danger of gray failures is far greater. As gray failures accumulate, the stress on healthy systems grows, and can lead to a cascading, headline-grabbing massive outage.

Gray failure's deep roots

Fault-tolerant systems rest on three pillars: redundancy, failure detection, and failure recovery. Redundancy is a given in cloud infrastructure. The problems come in failure detection and recovery.

The coders who write the software layer are rarely expert in the hardware that makes up the infrastructure. Often they make simplistic assumptions about how it fails and what needs to be detected.

But as any hardware engineer can tell you, there are many places hardware can go wrong without crashing or smoking. Intermittant hardware glitches, memory leaks, buffer overflows, and background jobs can all lead to a reduced performance or intermittent gray failure without an overt symptom that leads to a system reboot or hardware replacement.

Differential observability

The key symptom of a gray failure is what the authors call differential observability. If a server has slowed to a crawl, but its heartbeat is regular, an observing system won't see a problem, but a client system will. That's differential observability.

That leads the authors to make several recommendations to better detect and correct gray failures.

Don't rely on a single indicator, such as heartbeat, for system health.
Try to take an application view, rather than a hardware view, to detect gray failures.
Leverage scale for detection. For difficult gray failures you may need to collect observations from thousands of servers and use statistical inference to find the gray-failed components.
Temporal analysis. Tracking significant failures back in time to understand the small faults that led to the outage helps sharpen up the detection process.

The Storage Bits take

Gray failures are an extension of a class of bugs that the late, great, Jim Gray called "Heisenbugs", transient errors that disappear when you try to observe them due to subtle differences in initial conditions. Because of their transitory nature, no single tool or metric will capture them.

Does this mean that cloud infrastructures are doomed to fail under the weight of their increasing size and complexity? No. But it does mean that the tools used to manage them must become more sophisticated.

And infrastructure architects must become mindful of the subtleties of gray failure interactions with system design as discussed in the paper. For example, the counter-intuitive finding that greater redundancy can lead to lower availability.

Courteous comments welcome, of course. Bravo to Microsoft Research and the Azure folks for publishing this paper. It's nice to know that MS has some very smart people minding the store.

Editorial standards

Show Comments

Related

windows-bsod-outage

CrowdStrike caused Windows outage chaos for airports, banks, and more. Here's what happened

blue screens of death

What caused the great CrowdStrike-Windows meltdown of 2024? History has the answer

Levoit Vital 200S

The best Levoit air purifier I've tested is still 20% off after Prime Day