Autopsy of the post-mortem: If it can happen to Google, it will happen to you

Autopsy of the post-mortem: If it can happen to Google, it will happen to you

Summary: Google's App Engine outage at the end of February provides a graphic example of why datacenter managers need to have accurate and tested disaster recovery and business continuity plans in place.

SHARE:

On February 24th, in the middle of the business day on the east coast, Google's App Engine went away.  This is old news now; the problem was caused by a power outage at their primary data center, and they were back up and running in roughly two hours (7:48AM PST to 10:19 AM PST). Google outages aren't new; any user of Gmail has noticed that every now and then, the app runs slowly or not at all (...still loading...) and even the search engine has had its issues. But in the past the service slowdowns have been caused by network or routing issues.  In the case of the February App Engine failure, it was an actual datacenter power failure.

On March 4th the Google App Engine team posted a post-mortem summary in the Google Group dedicated to App Engine. It is an excellent, well presented document that clearly indicates what happened and why the (relatively) simple issue of recovering from a power failure at a critical datacenter caused more problems than it should have.

The takeaway that Google gives us in their online mea culpa is this; we found the problem, we fixed the problem, we uncovered a potential future problem, and we fixed that one too. Potential future downtime will be significantly reduced due to new procedures and Datastore configuration.

But the bottom line is this; the basic building blocks of a comprehensive disaster recovery / business continuity plan are pretty simple.  Once you have your components in place (people, products, and plans), you then test, document, and train; preferably before you roll out the solution. This is especially critical in the datacenter, due to the widespread impact of outage problems.

Google wasn't bit by an obscure type of failure; they lost 25% of their machines before they could transition to backup power, effectively causing a partial outage, which most recovery plans would include. But by their own admission, their staff wasn't properly trained on how to handle potential problems caused by the partial outage, nor had they documented an effective procedure for dealing with the problem that allowed rapid resolution.

Take this to heart for your own datacenters; Test equipment and procedures as you implement them and when you make changes to the datacenter. Document what works and what doesn't; outline the proper procedures for problem resolution and provide clear guidance on when problems get pushed up the chain of responsibility. Train your personnel, not just in dealing with common or obscure problems but in what to do when they quickly know they are in over their heads.

Topics: Outage, Data Centers, Google, Hardware, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

5 comments
Log in or register to join the discussion
  • There's NO substitute for good common sense...

    Welcome to ZDNet...!
    Wolfie2K3
  • RE: Autopsy of the post-mortem: If it can happen to Google, it will happen to you

    can't subscribe to the feed
    g_keramidas@...
  • First, Kudos to Google for...

    recognizing, admitting to the public, and (hopefully) learning from their mistakes.

    But what I was surprised with was the decision to changeover to the alternate data ctr @ 30 minutes...had they not had conflicting procedures and incorrect config, that would have been a 1 hour recovery instead of 2 hours.

    But like you said, proper testing and training would have addressed the time and confusion issue - the outlier at the 9:35 time stamp: "An engineer with familiarity with the unplanned failover procedure is reached, and begins providing guidance about the failover procedure" seems a bit 'San Fran Admin' to me...more people should have been privy to any and all failover procedures, and at least one should have been on site.
    SonofaSailor
    • It is good to see...

      that Google is still run like an engineering
      company. The sort of detail in this report is
      uncommon and quite refreshing.
      RobertFolkerts
  • RE: Autopsy of the post-mortem: If it can happen to Google, it will happen to you

    As I was told many, many times when I was trained "Failure to plan is a plan to fail". Google had no plan so had to fail.
    Agnostic_OS