Autopsy of the post-mortem: If it can happen to Google, it will happen to you

Google's App Engine outage at the end of February provides a graphic example of why datacenter managers need to have accurate and tested disaster recovery and business continuity plans in place.
Written by David Chernicoff, Contributor

On February 24th, in the middle of the business day on the east coast, Google's App Engine went away.  This is old news now; the problem was caused by a power outage at their primary data center, and they were back up and running in roughly two hours (7:48AM PST to 10:19 AM PST). Google outages aren't new; any user of Gmail has noticed that every now and then, the app runs slowly or not at all (...still loading...) and even the search engine has had its issues. But in the past the service slowdowns have been caused by network or routing issues.  In the case of the February App Engine failure, it was an actual datacenter power failure.

On March 4th the Google App Engine team posted a post-mortem summary in the Google Group dedicated to App Engine. It is an excellent, well presented document that clearly indicates what happened and why the (relatively) simple issue of recovering from a power failure at a critical datacenter caused more problems than it should have.

The takeaway that Google gives us in their online mea culpa is this; we found the problem, we fixed the problem, we uncovered a potential future problem, and we fixed that one too. Potential future downtime will be significantly reduced due to new procedures and Datastore configuration.

But the bottom line is this; the basic building blocks of a comprehensive disaster recovery / business continuity plan are pretty simple.  Once you have your components in place (people, products, and plans), you then test, document, and train; preferably before you roll out the solution. This is especially critical in the datacenter, due to the widespread impact of outage problems.

Google wasn't bit by an obscure type of failure; they lost 25% of their machines before they could transition to backup power, effectively causing a partial outage, which most recovery plans would include. But by their own admission, their staff wasn't properly trained on how to handle potential problems caused by the partial outage, nor had they documented an effective procedure for dealing with the problem that allowed rapid resolution.

Take this to heart for your own datacenters; Test equipment and procedures as you implement them and when you make changes to the datacenter. Document what works and what doesn't; outline the proper procedures for problem resolution and provide clear guidance on when problems get pushed up the chain of responsibility. Train your personnel, not just in dealing with common or obscure problems but in what to do when they quickly know they are in over their heads.

Editorial standards