Yesterday, Gmail went down. For nearly two hours, users of the world's most visible cloud service were stuck with a very old-fashioned "502 – unable to reach server" error message. That's apt: it was a very old-fashioned problem. Google had been hit by cascading failure.
The company is being unusually open about what went wrong: an upgrade to part of the system rendered that part unable to handle its usual traffic. It signalled this to the rest of the Gmail cloud, and handed over the requests it couldn't serve. Unfortunately, that triggered similar problems in the parts of the system handling the overflow: they in turn shut down and passed on the original overflow, plus their own traffic — and so on, and so forth.
Cascading failure is well known. In the past, it has brought down telephone systems, power distribution grids and, most recently, came close to toppling the global financial system. It happens in human biology. It is, in short, a classic problem, characterised by the speed at which it develops — Google knew about it within seconds of it kicking off — and the totality of its consequences.
In the case of Gmail, it even leaped the species barrier to completely independent services. For a while Twitter, now the world's back channel for cloud error reporting, was overloaded by people asking "Is Gmail down?".
Expect more of the same, as we build ever more complex and interconnected systems. The irony is that the major cause of the problem is good engineering, as Google admits: the upgrade that triggered the meltdown was designed to improve the very thing that went wrong.
Efficiency, normally a touchstone of proper design, is the enemy. With large systems, any over-engineering is very expensive, so the tendency is to plan for the worst case and build for that and no more. But any worst case can be made more so if the system itself starts to fail. There is no slack to soak up the sudden increase in demand on the remaining components, and the cosmos falls apart.
There is no cure. There is a smart way to prepare for the problem. Have classically inefficient systems that are relaxed about overloads. Don't try to offer everything to everybody — Gmail's alternative, non-web, access methods, far less popular, carried on working. Have separately engineered control and monitoring pathways that keep on going when the core functions are broken. Diversity, flexibility, inefficiency and the expectation of failure: these are the hallmarks of reliable distributed systems.
For those of us who look to the cloud for the next generation of computing, these lessons are essential. Fortunately, we have a good example that's been working for 40 years — the internet itself — and we still live in a heterogeneous world, where a diversity of options are bolstered by open standards that ensure flexibility.
It is essential that in every step we take into the cloud we ask ourselves "What happens when this goes wrong?" and expect a sensible reply before going further. Those who claim to have all the answers will end up being the universal problem.