Seven lessons to learn from Amazon's outage

Seven lessons to learn from Amazon's outage

Summary: After a harrowing four days, the remaining few customers still affected by Amazon's major outage are gradually coming back online. Here are seven key lessons to learn from this episode.

SHARE:
TOPICS: CXO, Amazon, Outage
58

As of the latest update this afternoon on Amazon's Service Health Dashboard, only a handful of customers are still waiting for their EBS and RDS instances to be restored after Thursday's harrowing outage. But for everyone involved (not least Amazon's own operations staff) it's been a very long four days (see latest Techmeme discussion). What are the lessons to learn?

1. Read your cloud provider's SLA very carefully

Amazingly, this almost four-day outage has not breached Amazon's EC2 SLA, which as a FAQ explains, "guarantees 99.95% availability of the service within a Region over a trailing 365 period." Since it has been the EBS and RDS services rather than EC2 itself that has failed (and all the failures have been restricted to Availability Zones within a single Region), the SLA has not been breached, legally speaking. That's no consolation for those affected of course, nor is it any excuse for the disruption they've suffered. But it certainly gives pause for thought.

2. Don't take your provider's assurances for granted

Many of the affected customers were paying extra to host their instances in more than one Availability Zone (AZ). Amazon actually recommends this course of action to ensure resilience against failure. Each AZ, according to Amazon's FAQ, "runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone." Unfortunately, this turned out to be a technical specification rather than a contractual guarantee. It will take Amazon quite some effort to repair the reputational damage this event has brought upon it.

Justin Santa Barbara, founder and CEO of FathomDB was forthright in his blog post on Why the sky is falling:

"AWS broke their promises on the failure scenarios for Availability Zones ... The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point."

While it's easy to be wise after the event, Amazon's vulnerability to this type of failure may have been visible on a deep-enough due diligence exercise. As Amazon competitor Joyent's Chief Scientist Jason Hoffman notes on the company's blog, "This is not a 'speed bump' or a 'cloud failure' or 'growing pains', this is a foreseeable consequence of fundamental architectural decisions made by Amazon."

3. Most customers will still forgive Amazon its failings

However badly they've been affected, providers have sung Amazon's praises in recognition of how much it's helped them run a powerful infrastructure at lower cost and effort. Many prefaced criticisms with gratitude for what Amazon had made possible, such as BigDoor's CEO Keith Smith:

"AWS has allowed us to scale a complex system quickly, and extremely cost effectively. At any given point in time, we have 12 database servers, 45 app servers, six static servers and six analytics servers up and running. Our systems auto-scale when traffic or processing requirements spike, and auto-shrink when not needed in order to conserve dollars."

4. There are many ways you can supplement a cloud provider's resilience

As O'Reilly's George Reese points out, "if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model." It's useful to review the techniques customers have used to minimize their exposure to failures at Amazon.

Twilio, for example, didn't go down. Although the company hasn't explained exactly what its exposure was to the affected North Virginia Availability Zones, it has described its architectural design principles in a first entry on its new engineering blog by co-founder and CTO Evan Cooke. These include decomposing resources into independent pools, building in support for quick timeouts and retries, and having idempotent interfaces that allow multiple retries of failed requests. Of course all this is easier said than done if all your experience is in designing tightly-coupled enterprise application stacks that assume a resilient local area network. Cooke's post goes on to describe some of the characteristics that make Twilio's architecture capable of operating in this more fault tolerant manner. To start with, "Separate business logic into small stateless services that can be organized in simple homogeneous pools." Another step is to partition the reading and writing of data: "if there is a large pool of data that is written infrequently, separate the reads and writes to that data ... For example, by writing to a database master and reading from database slaves, you can scale up the number of read slaves to improve availability and performance."

Another site that didn't go down is NetFlix, which runs all its infrastructure in the Amazon cloud. Again, it's not clear how exposed its operations were to the affected Amazon resources, but a Hacker News thread usefully summarizes some of the principles employed.

5. Building in extra resilience comes at a cost

Bob Warfield describes how a previous company used Amazon.com infrastructure in a way that allowed it to "bring back the service in another region if the one we were in totally failed within 20 minutes and with no more than 5 minutes of data loss." As he goes on to say, the choices you make about the length of outage you're prepared to support have consequences for the cost your customers or enterprise must fund. "Smart users and PaaS vendors will look into packaging several options because you should be backed up to S3 regardless, so what you’re basically arguing about and paying extra for is how 'warm' the alternate site is and how much has to be spun up from scratch via S3."

6. Understanding the trade-offs helps you frame what to ask

There are questions you should be asking to satisfy yourself that a cloud service you rely on is not exposing you to a similar failure (or at least that, if it is, you understand this and are willing to bear the consequences in return for a cheaper cost). Referring to NetFlix's practice of randomly killing resources and services in order to test its resilience, Bob Warfield adds this advice:

"That's likely another good question to ask your PaaS and Cloud vendors — "Do you take down production infrastructure to test your failover?" Of course you'd like to see that and not just take their word for it too."

7. Lack of transparency may be Amazon's 'Achilles heel'

Several affected customers have complained of the lack of useful information forthcoming from Amazon during the outage. BigDoor CEO Keith Smith wrote, "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner." GoodData's Roman Stanek called on Amazon to tear down its wall of secrecy:

"Our dev-ops people can't read from the tea-leaves how to organize our systems for performance, scalability and most importantly disaster recovery. The difference between 'reasonable' SLAs and 'five-9s' is the difference between improvisation and the complete alignment of our respective operational processes ... There should not be communication walls between IaaS, PaaS, SaaS and customer layers of the cloud infrastructure."

Amazon's challenge in the coming weeks is to show that it is prepared to give its customers the information it needs to build in that resilience reliably. If it does not meet that need and allows others to do better, it may gradually start losing its dominant position today in IaaS provision.

Topics: CXO, Amazon, Outage

Phil Wainewright

About Phil Wainewright

Since 1998, Phil Wainewright has been a thought leader in cloud computing as a blogger, analyst and consultant.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

58 comments
Log in or register to join the discussion
  • Relying on a "generic" solution for mission critical services ...

    .. is plain stupid.

    How come nobody is posting the damages to companies during the outage? Could it be because they don't want others to see that one single "cloud" outage can cause MILLIONS in loses for any company?
    wackoae
    • No one can claim damages as the outage was within the lines of serviceterms

      @wackoae: but, <b>the biggest lesson is to avoid moving IT infrastructure to the cloud.</b><br><br>Responsible IT department of a company has to have control of its own data, and run its own servers. <br><br>And the building "supporting", "complimenting", "just in case (when the clouds falls off to the ground as rain)" IT-infrastructure is a hassle not much easier than keeping actual servers.
      DDERSSS
      • RE: Seven lessons to learn from Amazon's outage

        @DeRSSS Indeed.

        While I cannot guarantee that the infrastructure under my control won't suffer a similar outage, I can guarantee two things - transparency, and that ours will be worked on before anyone else's.
        alec.wood@...
    • RE: Seven lessons to learn from Amazon's outage

      @wackoae It's really not hard to protect yourself against outages in an availability zone... It took my company all of 2 hours to recover and begin running our app again.
      snoop0x7b
    • RE: Seven lessons to learn from Amazon's outage

      @wackoae

      Uh, so your assumption is:

      Given the tools and opportunity to create full resiliency, they built everything into one zone, they didn't. However, if they would have needed to buy a building and their own set of equipment in another state they would have.

      ?
      tkejlboom
    • Pushing For The Clouds

      Cloud computing is the means by which a single strong armed entity will gain complete control over internet transmissions. People like Phil Wainewright and his "comrades?" pushing for the singular on off switch approach. "The Cloud" is perhaps the single greatest attack on the freedom of the internet. I'm seventy three and have been working on the internet since the bulletin board days. Once a large proportion of the global community succumbs to the "Cloud", you can all kiss the internet goodbye!
      ramjetski@...
  • RE: Seven lessons to learn from Amazon's outage

    Outages of this nature do not absolve CIOs of responsibility for their systems. Many of the affected sites were taking a "wasn't me" kind of position. See my comments on this, written during the outage, at http://cloud81.com
    Alan.Perkins@...
  • RE: Seven lessons to learn from Amazon's outage

    SLA? EBS? RDS? English please!
    ShowMeGrrl
    • RE: Seven lessons to learn from Amazon's outage

      @ShowMeGrrl

      this is English

      SLA = Service Level Agreement
      EBS = Elastic Block Storage
      RDS = Relational Database System

      the above acronym expansions were done in about 20 seconds with Google (to confirm quick easy results), though SLA and RDS should be common knowledge for anyone in enterprise level IT infrastructure support
      erik.soderquist
    • RE: Seven lessons to learn from Amazon's outage

      @ShowMeGrrl <br><br>ProTip: Google or Wikipedia<br><br>This is "just a blog post" after all.
      awtripp
      • RE: Seven lessons to learn from Amazon's outage

        @awtripp When I went to school, they taught us that when writing an article about something that is expressed as an acronym, you expand the acronym the first time you use it. So you might write "Service Level Agreement (SLA)" the first time you use it, then just "SLA" through the rest of the article. Considering how much information we get from pages of text on the Web, it's astonishing how many of them don't follow basic writing techniques or bother to proof read before publishing. I've seen this in major sites like the New York Times (NYT) as well as less lofty sites such as this. Sad.
        JoeFoerster
      • RE: Seven lessons to learn from Amazon's outage

        It's true what erik.soderquist says, that it "should be common knowledge", as these articles are mostly aimed at techies and the like. But some not-so-technical people are trying to learn and follow what's going on. I agree with what JoeFoerster says then - "follow basic writing techniques".
        PEACE
        anonymous
      • Acronyms

        @awtripp @erik.soderquist
        The same acronym can be used for multiple things. I'm an Oracle guy and wondering if EBS might be referring to Oracle's E-Business Suite (which is called EBS all over the place), especially if RDS referrs to Relational Database Service. And most of the time (again in the Oracle universe) I see databases referred to as RDBMS (for Relational DataBase Management System), so RDS was a little confusing to me without looking it up. It's obviously not the end of the world to have to look these up, but for me, it would have been nice to have these acronyms should be spelled out first.
        markh@...
  • RE: Seven lessons to learn from Amazon's outage

    No plan or system is foolproof.
    Take an example of Japans nuclear plant , earthquake damaged it , the backup systems failed and disaster happened.
    Can U guarantee nothing like this will happen to american plant?? Answer is No.

    On similar grounds things can go wrong with any network or server or anything for that matter. No human design is foolproof and if anyone assumed it that way they are to blame and no one else.
    Atleast amazon pioneered cloud computing. Probably because the disaster happened to them it got recovered in 3 /4 days .
    Had it been any other company ... dream on....

    Yes some customers were at loss. But imagine this as break from computer for couple days and as long weekend .... Geeks need break too. :)
    beepositive
    • Well now, not so fast

      @beepositive

      "Can U guarantee nothing like this will happen to american plant?? Answer is No."

      Hold your hive, there beepositive. First, this was one of the largest earthquakes in recorded history (9.0) that moved the *entire country* 8 feet!

      Second, there was a freaking *tsunami* that hit the plant (from said 9.0 earthquake) that overwhelmed the existing defenses.

      Third, these plants were 40 years old and near the end of their service life. We've learned a lot since then.

      So in general, *no* it couldn't happen here--at least for the same reasons.
      wolf_z
      • RE: Seven lessons to learn from Amazon's outage

        @wolf_z
        Well, the point here is no Human built system is ever foolproof and no one can ever plan for all possible risk factors or disasters that might hit them. Which is true for anyone and everyone.
        beepositive
      • RE: Seven lessons to learn from Amazon's outage

        @wolf_z And remember, despite all these really bad things happening, there has been, and probably will be no deaths. The only serious injury was a man being injured by a crane. And the result of the radiation leak will probably be an increase in cancer risk for people working on recovering the plant from 25% to 25.001% - i.e. negligible.
        rayand
    • RE: Seven lessons to learn from Amazon's outage

      @beepositive

      Also, it make strong case for nuclear. You <e>could</e> have a problem every 40 years with a 1/1000 year cataclysmic event

      COMPARE:
      10s of thousands die in coal mines in China each year,
      $100s of billions of damage done to the Gulf economy by last years spill.
      $10s of billions of damage to fisheries and the environment from the
      Exxon Valdez,
      Entire cities wiped away by coal dam failures in the US leaving behind a toxic sludge that must be burned away at 2000 degrees before the region is once again habitable.
      Money goes to the Middle East to support terrorism(Gaddafi Lockerbie) and oppression(all of them) for oil.

      CONCLUSION:

      Nuclear, like the cloud isn't perfect, it's just the best of all options before us.
      tkejlboom
  • Main lesson to learn

    Azure is the way to go
    hubivedder
    • RE: Seven lessons to learn from Amazon's outage

      @hubivedder
      What makes Azure any different?
      pacsguy