Amazon explains latest cloud outage: Blame the power

Amazon explains latest cloud outage: Blame the power

Summary: Amazon has explained why its cloud service failed on Thursday. Blame the power, but the company's transparency prevailed.

SHARE:
TOPICS: Servers, Amazon, Outage
24

On Thursday, retail turned cloud giant Amazon suffered an outage to its Amazon Web Services in a North Virginia datacenter.

Many popular websites, including Quora, Hipchat, and Heroku --- a division of Salesforce --- were knocked offline for hours during the evening hours. Even Dropbox stumbled as a result of the outage.

Amazon was quick to detail what had gone wrong, when, and roughly why in a feat of transparency rarely seen by cloud providers, with the exception of perhaps Google.

Only a few days later, Amazon explained the cause of the fault --- which hit its Elastic Compute Cloud (EC2) service --- was no other than a power failure.

For those whose browser doesn't speak RSS, Amazon explained:

"At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power."

And then an epic feat of bad luck kicked in, as one of the vital power generators checked out:

"At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity).

Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power."

Hacker News readers were quick to point out some of the flaws in the logic. One suggested while Amazon had a "correct setup" of generator fallback rather than a battery solution, it failed in the testing department.

And then it got awfully geeky, terribly quickly.

The power failed: it's as simple as that. That should be blame-game reason number one. Yet number two, three and four --- and to the nth degree --- should be blamed on poor testing and a failure to test the series of backup power systems.

At least Amazon had the guts to flat-out admit it. One thing prevails over all others: Amazon kept its customers in the loop, which says a lot compared to a lot of other major cloud providers --- naming no names.

Related:

Topics: Servers, Amazon, Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

24 comments
Log in or register to join the discussion
  • Testing breakers under load not as easy as it seems.

    There are settings on larger breakers for short circuit delay and ground fault delay times as well as overcurrent and ground fault sensitivity. For safety, you want these low as possible but for reliability, you don't. Testing under the actual load guarantees that there is a risk of failure while online.

    Generators are another story because they can be load tested, but that requires hooking them up to a load and keeping them on it. It's nice to do this on hot days to get a better picture of what real scenarios would look like because that's when power fails most often. This is tricky but doable.
    rp518
    • The choice is theirs

      "It is tricky but doable." The question is why not do it if it would provide more reliable service and its really not extremely difficult. And even if a generator catches fire the service can still go uninterrupted. This is a blog article, http://specialoffers.peakhosting.com/blog/ written by a hosting provider that gives a different point of view than most written by news media covering the outage. Amazon issued statements clarifying what happened during the outage but he goes on to say the decision behind these completely avoidable outages comes down to economics.
      ej281
  • Understandable

    I've been in a similar situation with our company. We had tested the generator (we usually run it for a full day for testing) and failed over capabilities however when a snow storm hit a few years ago the power went out and something blew a battery in our UPS even though the generator worked fine. The UPS that sat inline made everything power off and since it was such a bad blizzard no one could get into the job to turn the equipment back on until a day later. The moral of the story even if you test something still might happen.
    Doink
  • "Cloud" proving its worth

    I would point and stare, but there will be plenty more incidents filling this space.
    Trusting stuff to "the cloud" is lunacy.....
    12312332123
    • Are you serious

      I have very little in "the cloud" - some email and a few documents. But even with this outage, Amazon's EC2 uptime trumps the uptime at my office by at least an order of magnitude, and at a very similar price point.
      kylehutson
    • So totally agree

      'Lunacy' puts it mildly...
      du.gtown
    • Don't you think that depends on the cloud provider?

      Failures will happen, there's no doubt of that; but Amazon's single point of failure here is that there was no backup location unaffected by the initial power outage. When you're handling national and even international customers, you can't afford a single point of failure.
      Vulpinemac
  • Testing can be worse than actual outage(s)

    At a former employer, I honestly suggested they stop testing the emergency backup systems, because the routine testing itself (on more than one occasion) caused more unexpected outages than occurred in practice. [They insisted on testing with a normal production setup.] I'm sure it could have been made to work, but the testing was killing the developers. The Amazon outage sounded like the same kind of not quite so predictable cascaded failure. The nice news out of this is that that Amazon is sharing what happened and that they do try to address failures in their own daily operations, which I have seen on several occasions.
    lorddarthpaul
  • Cloud a failure

    When will companies and people realize that when you give anyone the power to run your information, you will doom yourself to failure?
    trust2112@...
    • It depends...

      You also doom yourself to failure if the level of service provided by the cloud company is BETTER than what you can afford to provided internally. You are saying it would be better for a company to choose an internal solution with 98% uptime vs. a cloud solution with 99.995% uptime?
      tbuccelli
  • Live by the Cloud, Die by the Cloud

    Enough said.
    Shara8
  • Humanity's penchant for denying the need for backup/testing

    Good back up design and testing isn't glamorous ... unless you need it and it works like a charm ... then you seem quite the hero. The upper-ups usually such measures out of projects as cost saving measures. But those breach of SLA fines can add up fast when your under-backed up, under tested systems stumble and your customers start sending you bills.

    I've worked on a lot of system implementations over the past 30 years. And it doesn't seem to matter whether the project is dirt and 'dozers, bricks and mortar, hardware, software or combinations of the four, there is an apparently universal tendency for the "powers that be" to assume that a) under-tested backups will always work and b) we don't really need THAT many of them. Yet, I can tally up quite a long list of instances where the only reason the areas I was responsible for were able to continue functioning was that I created my own back up and tested the blazes out of it before the "go live" date arrived. When everything else ground to a halt, my heavily tested back ups hummed right along, in one instance for as long as a year.

    On one project in which I participated, there was a brand new computer system (hardware and software) going into a brand new physical facility. I saw the plans and strongly recommended that the office A/C unit be backed up by a simple wall unit, installed in the room in which the computers were to be located, fed by a separate power supply. I was told that this was ridiculous, as the office was inside a fan cooled building, and the office itself was served by a "state of the art" A/C unit. As a concession, they did agree to give me the separate power supply I asked for, just in case.

    Within the first month of operation, the office A/C failed (installation error), the room quickly over heated, and the computer system shut down. Crews scrambled to hack a rather ugly hole in the block wall and install the aforementioned wall mounted A/C and plug it into the extra outlet I'd won. My utterly selfish sense of vindication was shameful. :) In a matter of hours we were back up. But the point was clear ... had the very inexpensive back up A/C unit been installed in the beginning, the outage never would have happened in the first place.
    justin.donie@...
    • RE: My utterly selfish sense of vindication was shameful.

      [b]IF[/b] you are of the spiteful type, then going over the head of the PHB that nixed your A/C unit to [b]his boss[/b] and pointing out (hopefully, you got that denial [u]in writing[/u]) that [b]you foresaw this event[/b], and unfortunately, that PHB that denied your request exercised [b]poor judgment[/b], and ought to be fired.

      The only time middle level PHB's [i]get the message[/i] is when their ass is flying out the door with a higher up's foot closely behind it.

      PHB's that screw up big time need to be promoted to a position with another company; and [i]preferably with one of your competitors[/i].

      EDITS for typos, God do I hate this keyboard!!!!
      fatman65536
  • Redundancy is key....

    Speaking as a consumer, I treat the cloud as off-site backup FOR my backup. I have files duplicated across my PCs, plus they are backed up to two external drives. I then have the dearest of THOSE files (mostly family photos, certain important records) uploaded to the cloud, because if it's more serious than equipment failure (fire/tornado), I'm going to be too busy saving my husband and pets to worry about grabbing a hard drive or two.

    Redundancy plus off-site storage is your safest bet. For people without huge amounts of data, it can be as simple as keeping a thumb drive or DVD secured at the office or a trusted relative's home. The cloud should only be an EXTRA place to store data, not THE place.
    bengalcatlover
  • if you think amazon is trasnparent

    I want to sell you a bridge in brooklyn
    martinkimeldorf
  • There was no problem, since, Amazon explained the outage,

    and, people will be forgiving and will forget, and life will resume as normal, until, the next outage, when, excuses and explanations for the outage will suffice, again. Amazon and Google and Microsoft don't have any worries, since, all they have to do is be up-front about their outages and problems. So, no need to worry about "the cloud". So what if your company had its work interrupted. Amazon explained it all, and that makes everything hunky-dory.
    adornoe
    • You sort of backed into a legitimate issue

      With Microsoft, your SLA provides you financial remuneration. With Google, you will be granted more time at the back end of the contract. I do not follow Amazon, I could not tell you what the SLA allows for. But the key is to ensure that your SLA matches your business needs.

      Am I worried about the "cloud"? No. Because I know that when Microsoft states they have a "Generator" what they really mean is a locomotive engine size powerhouse. I know that our HP facility has a dedicated primarly substation. Am I going to invest my companies money on locomotives? ummm, no. But, I am going to ensure that my SLA with Microsoft provides financial remuneration if I do experience an issue.
      Your Non Advocate
  • Why all the cloud negativity?

    Most non-IT centric companies would probably still be down if this scenario happened to their own datacenter. During that outage your best and brightest support staff get pulled away to perform DR duty which means non-impacted LOBs suffer as well. Just because you can pick up the phone and yell at someone internally doesn't mean the failure will be resolved any faster. The major cloud providers have some of the best "propeller heads" working at their firms. If you attempted to duplicate what they provide, your IT budget would be 30-50% of operating expense which is unrealistic for even financials and pharma.

    Trusting some of your applications to the cloud doesn't negate proper design and fail-over strategy. Who is to blame for not having your app instances running at more than one datacenter? The tech companies who were bit by this outage should have known better, yet it's easier to skimp on costs. Don't equate "cloud" with cheap because it isn't. Cloud is just another tool in the arsenal which can be deployed poorly in the wrong hands.
    Tired Tech
  • Cyberattack?

    Someone took out power for a reason?
    (Anyone?)
    f0real
    • Excavation Activities

      Rumors of an Azure employee seen on the outskirts of town operating a backhoe have not been corroborated :)
      Tired Tech