ie8 fix
madison

Autopsy of the post-mortem: If it can happen to Google, it will happen to you

By | March 9, 2010, 12:24pm PST

Summary: Google’s App Engine outage at the end of February provides a graphic example of why datacenter managers need to have accurate and tested disaster recovery and business continuity plans in place.

On February 24th, in the middle of the business day on the east coast, Google’s App Engine went away.  This is old news now; the problem was caused by a power outage at their primary data center, and they were back up and running in roughly two hours (7:48AM PST to 10:19 AM PST). Google outages aren’t new; any user of Gmail has noticed that every now and then, the app runs slowly or not at all (…still loading…) and even the search engine has had its issues. But in the past the service slowdowns have been caused by network or routing issues.  In the case of the February App Engine failure, it was an actual datacenter power failure.

On March 4th the Google App Engine team posted a post-mortem summary in the Google Group dedicated to App Engine. It is an excellent, well presented document that clearly indicates what happened and why the (relatively) simple issue of recovering from a power failure at a critical datacenter caused more problems than it should have.

The takeaway that Google gives us in their online mea culpa is this; we found the problem, we fixed the problem, we uncovered a potential future problem, and we fixed that one too. Potential future downtime will be significantly reduced due to new procedures and Datastore configuration.

But the bottom line is this; the basic building blocks of a comprehensive disaster recovery / business continuity plan are pretty simple.  Once you have your components in place (people, products, and plans), you then test, document, and train; preferably before you roll out the solution. This is especially critical in the datacenter, due to the widespread impact of outage problems.

Google wasn’t bit by an obscure type of failure; they lost 25% of their machines before they could transition to backup power, effectively causing a partial outage, which most recovery plans would include. But by their own admission, their staff wasn’t properly trained on how to handle potential problems caused by the partial outage, nor had they documented an effective procedure for dealing with the problem that allowed rapid resolution.

Take this to heart for your own datacenters; Test equipment and procedures as you implement them and when you make changes to the datacenter. Document what works and what doesn’t; outline the proper procedures for problem resolution and provide clear guidance on when problems get pushed up the chain of responsibility. Train your personnel, not just in dealing with common or obscure problems but in what to do when they quickly know they are in over their heads.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

With more than 20 years of published writings about technology, as well as industry stints as everything from a database developer to CTO, David Chernicoff has earned the term "veteran" in the technology world.

Disclosure

David Chernicoff

David does not invest in the technology he covers. As a freelance author and technologist he has had contract work with many vendors in the industry. Beyond the term of these short-term contracts there is no business or fiduciary arrangement with any technology vendor. David does not enter into contracts that would limit his freedom of expression in any way, nor is he remunerated for discussing any vendor. All comments in his blog writings are solely the opinions of David Chernicoff.

Biography

David Chernicoff

With more than 20 years of published writings about technology, as well as industry stints as everything from a database developer to CTO, David Chernicoff has earned the term "veteran" in the technology world. Currently the principal of an independent consulting business and an active freelance writer, David has most recently been a Senior Contributing Editor for Windows IT Pro magazine, having also been the Lab Director for Windows NT Magazine, Technical Director of PC Week Labs, the author or co-author of a number of books on different versions of Windows, a plethora of eBooks on various technology topics, and of approximately 3000 magazine articles in print and on the web.

Related Discussions on TechRepublic

Did you know you can take part in these discussions with your ZDNet membership?
5
Comments

Join the conversation!

Just In

It is good to see...
RobertFolkerts 23rd Mar 2010
that Google is still run like an engineering
company. The sort of detail in this report is
uncommon and quite refreshing.
0 Votes
+ -
Welcome to ZDNet...!
can't subscribe to the feed
0 Votes
+ -
First, Kudos to Google for...
SonofaSailor 10th Mar 2010
recognizing, admitting to the public, and (hopefully) learning from their mistakes.

But what I was surprised with was the decision to changeover to the alternate data ctr @ 30 minutes...had they not had conflicting procedures and incorrect config, that would have been a 1 hour recovery instead of 2 hours.

But like you said, proper testing and training would have addressed the time and confusion issue - the outlier at the 9:35 time stamp: "An engineer with familiarity with the unplanned failover procedure is reached, and begins providing guidance about the failover procedure" seems a bit 'San Fran Admin' to me...more people should have been privy to any and all failover procedures, and at least one should have been on site.
0 Votes
+ -
It is good to see...
RobertFolkerts 23rd Mar 2010
that Google is still run like an engineering
company. The sort of detail in this report is
uncommon and quite refreshing.
As I was told many, many times when I was trained "Failure to plan is a plan to fail". Google had no plan so had to fail.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix
Click Here
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix
ie8 fix