Google gets a cold and the world gets pneumonia

Well, maybe not pneumonia, but at least a nasty case of bronchitis.
Written by Christopher Dawson, Contributor

Gmail went down on Monday. Not for a particularly long time. 33 minutes from outage to complete resolution, in fact. Late risers on the west coast probably wouldn't even have known about it if not for panicking tech pundits from the east coast. To hear Wired talk about it, this portends the end of the world as we know it. OK, they weren't quite that over-the-top, but they, like many news outlets, had some very dramatic sound bites about the issue.

I'm not dismissing this outage, by the way. I live, eat, and breathe Google and the Gmail outage (caused by a bad update to their load-balancing software) had ripple effects across many related services (including the Chrome browser for users who, like me, choose to sync data across their various services). This isn't a small thing and, in fact, leads to the title of this article.

For Google, it was a hiccup. A bit of bad software rolls out, doesn't work, and gets rolled back. For the millions of people who rely on Google to get their jobs done, to enable important (and sometimes critical) business and personal communications, to write and calculate and advertise and sell, even a minor blip is cause for concern. As one analyst posed in the aforementioned Wired article,

“Imagine a scenario where you can’t even open your Android phone or you can’t get phone calls on Google Voice. it’s not just your browser.”

Given the market penetration of Android and projected domination of the mobile space, this sounds like a nightmare scenario. One wrong move from Google and all of our phones, tablets, Chromebooks, browsers, and communication tools go dead, assuming we've bought into the whole Google ecosystem (and many of us have). Doctors don't get urgent messages, stocks don't get traded, teenagers around the world stop texting for half an hour...you get the idea.

In reality, it's also a pretty damned unlikely scenario. In part, problems like those encountered Monday are rare anyway and Google's business model relies on the trust of its users. Google has the ultimate vested interest in ensuring problems like these don't happen.

Let's also keep in mind that Google detected the problem via its own monitoring software within 21 minutes and took action 7 minutes later. Just a few minutes later, the bad update was rolled back off of its production servers. There aren't many IT departments that can claim that sort of response time for on-premise communication and collaboration software. All users had to do was tweet about the Gmail outage for half an hour and they were back up and running.

Yes, there are risks involved in putting all of your IT eggs in one basket, whether that basket is in Mountanview, Redmond, Seattle, or somewhere else.. What's the alternative, though? Several disparate systems from several vendors, requring either separate federation systems or countless user logins? Or expensive, highly redundant on-premise solutions? Even Microsoft and its partners are doing a healthy business selling hosted solutions because they generally save time and money.

Whether your system of choice comes from Google, Microsoft, Amazon, Apple, or sits in your own datacenter, someday it's going to go down. Service providers strive for "five nines" or 99.999% uptime. That's a great goal, but even that goal (a stretch for many) implies that some downtime is inevitable.

Google's success means that even that tiny amount of downtime has wide-ranging, worldwide effects and commensurate headlines and Twitter outrage. However, it's important to keep this in perspective. When a plane crashes, it makes headlines for days. Hundreds of people might die at once. And yet 3000 people die every day worldwide in car accidents, very few of which we ever hear about. It's a matter of scale that makes front-page news.

Are Google's or Amazon's scale reason enough to avoid the cloud? Not at all. The conveniences and cost savings for most businesses make occasional downtime an extremely reasonable risk for the majority of businesses and individuals. The key is managing panic when things do go wrong, as well as demanding that cloud providers (the big guns in particular) continue to innovate and offer better reliability at better prices than we can achieve ourselves.

Editorial standards