Gmail: When efficiency equals death

Gmail: When efficiency equals death

Summary: Google emphasises cloud reliability, but Gmail's gone down again. A very old problem is new again

Yesterday, Gmail went down. For nearly two hours, users of the world's most visible cloud service were stuck with a very old-fashioned "502 – unable to reach server" error message. That's apt: it was a very old-fashioned problem. Google had been hit by cascading failure.

The company is being unusually open about what went wrong: an upgrade to part of the system rendered that part unable to handle its usual traffic. It signalled this to the rest of the Gmail cloud, and handed over the requests it couldn't serve. Unfortunately, that triggered similar problems in the parts of the system handling the overflow: they in turn shut down and passed on the original overflow, plus their own traffic — and so on, and so forth.

Cascading failure is well known. In the past, it has brought down telephone systems, power distribution grids and, most recently, came close to toppling the global financial system. It happens in human biology. It is, in short, a classic problem, characterised by the speed at which it develops — Google knew about it within seconds of it kicking off — and the totality of its consequences.

In the case of Gmail, it even leaped the species barrier to completely independent services. For a while Twitter, now the world's back channel for cloud error reporting, was overloaded by people asking "Is Gmail down?".

Expect more of the same, as we build ever more complex and interconnected systems. The irony is that the major cause of the problem is good engineering, as Google admits: the upgrade that triggered the meltdown was designed to improve the very thing that went wrong.

Efficiency, normally a touchstone of proper design, is the enemy. With large systems, any over-engineering is very expensive, so the tendency is to plan for the worst case and build for that and no more. But any worst case can be made more so if the system itself starts to fail. There is no slack to soak up the sudden increase in demand on the remaining components, and the cosmos falls apart.

There is no cure. There is a smart way to prepare for the problem. Have classically inefficient systems that are relaxed about overloads. Don't try to offer everything to everybody — Gmail's alternative, non-web, access methods, far less popular, carried on working. Have separately engineered control and monitoring pathways that keep on going when the core functions are broken. Diversity, flexibility, inefficiency and the expectation of failure: these are the hallmarks of reliable distributed systems.

For those of us who look to the cloud for the next generation of computing, these lessons are essential. Fortunately, we have a good example that's been working for 40 years — the internet itself — and we still live in a heterogeneous world, where a diversity of options are bolstered by open standards that ensure flexibility.

It is essential that in every step we take into the cloud we ask ourselves "What happens when this goes wrong?" and expect a sensible reply before going further. Those who claim to have all the answers will end up being the universal problem.


Topic: Apps

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Spot On

    I'm pleased you posed the question "What happens WHEN this go wrong?" rather than "What happens IF this goes wrong?" However I would go further and say the mark of a true engineer (hardware or software) would be, "What do we do when SEVERAL things go wrong?"

    I think this has actually happened at a good time. Enough people have been inconvenienced for it to be a real Wake-Up call, but not enough for it to be a disaster. We might not be so lucky next time.
  • What if Google Health Crashes?

    From the graphic on this piece and the headline I thought you might extrapolate into the implications for the Conservatives' recent Electronic Medical Record plan which looks like it could be based around Google Health or Microsoft's Health Vault offering?

    Gmail going down is one thing but health records crashing? Obviously I am being alarmist as there are a lot of hoops to jump through before the Conservatives have the chance to try out the plan which should give the IT industry the time to make Cloud apps rigorous enough to do the job.
    Andrew Donoghue
  • You get what you pay for...

    As an occasional Gmail user who accesses it over IMAP, I can't say I noticed. Goodness, that sounds smug. Perhaps it is.

    But the lesson surely is that you cannot afford to rely absolutely on a system that's free. Paid-for systems are not necessarily more reliable - it'd be a fool who claimed that, I feel, although some research might not go amiss here - but at least users have some form of redress in the form of an SLA. And if service providers know their mortgages are resting on the service's continuity, it concentrates the mind wonderfully...
    Manek Dubash