'

Why you should be glad about Gmail failures

The way I look at it, every Gmail outage is a small investment I'm willing to make towards a future when I'll be able to take its reliability utterly for granted.

Gmail is having problems again today and some users are squirming while others aren't worried.

Of course it's a hassle when Gmail's not there any more — I found my work rhythm was interrupted and instead of writing and sending some emails as I'd planned, I had to switch to another task and they're still sitting on my to-do list now. But the way I look at it, every Gmail outage is a small investment I'm willing to make towards a future when I'll be able to take its reliability utterly for granted.

With every Gmail fail, Google learns more about operating a cloud-scale, enterprise-class email infrastructure. While it may be true that Hotmail and Yahoo! Mail have more registered users and traffic, neither of them are trying to attract enterprise customers as Google is with its Google Apps suite (of which Gmail is the flagship application). That means no one has ever attempted what Gmail is now doing, and with each slip-up along the way, it learns how to do it better.

Remember the big outage that affected the Gmail web interface on the 1st of this month? Here's what the Gmail team posted about it later that day:

"This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

"However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!'. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

"We've turned our full attention to helping ensure this kind of event doesn't happen again. Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements ..."

You see what I mean? Learning, learning, learning from every glitch, and as soon as the solution is found it's implemented to the benefit of every one of Gmail's millions of users. As my friend and fellow Enterprise Irregular Anshu Sharma wrote a while back, this is one of the unsung benefits of multi-tenancy. The disasters may be high-profile, but that just incents the provider even more to avoid them in the future. Whereas a software vendor of on-premise, single-tenant applications has little incentive to fix problems that only affect one customer at a time, even if the aggregate outage time is far more severe once you add up the results of each individual failure.

I realize it would be better still if Gmail didn't fail at all, ever. But think of each small outage as one more step along the path to that ultimate nirvana.