Amazon explains its S3 outage

Amazon explains its S3 outage

Summary: Amazon has issued a statement that adds a little more clarity to its Web services outage on Friday.Here's Amazon's explanation of the S3 outage, which wreaked havoc on startups and other enterprises relying on Amazon's cloud.

SHARE:
TOPICS: Amazon, Outage, Security
8

Amazon has issued a statement that adds a little more clarity to its Web services outage on Friday.

Here's Amazon's explanation of the S3 outage, which wreaked havoc on startups and other enterprises relying on Amazon's cloud.

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we're proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

Sincerely, The Amazon Web Services Team

Nick Carr has more. A few takeaways:

  • Amazon is creating an uptime dashboard. That's a positive development.
  • For now, you can chalk this outage up to growing pains at Amazon's Web services.
  • Amazon needs to work on its customer communication.
  • These cloud services are really an expectations game. As some talkbackers have noted electric service also goes down from time to time too. Do we hold computing power to a higher standard than the electric grid?

The customer reaction to Amazon's explanation verified those aforementioned takeaways.  This post summed the customer perspective up well.

Thanks for the update. As to your longer-term plans to handle this, your plan to provide a "service health dashboard" is a particular good idea!! However to provide a truly excellent service health solution, Amazon need to provide machine-readable management data that we can integrate in our infrastrucure so that we in turn can tell our customers what is going on! Besides machine-readable info, aws blog-updates, RSS feeds and email notifications of major service health issues is a must! In addition customers like us need to be able to setup a error page redirect for if EC2 is down (so that users trying to access a web server hosted on EC2 will get a decent error if your normal EC2 infrastructure is down).

BTW: Our company's biggest complaint is not that your servers where down but that we had to do quite some detective work to find out why (an error on our own servers?, a recent code update we did ? or the amazon aws itself ?). In this case lack of information from Amazon cost us more trouble/money that the actually outage. It is simply NOT good enough that you require your customers to browse through forum threads to find out what is wrong. There was NO info you your front page, no email notifications, no updates on your blog.

Topics: Amazon, Outage, Security

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

8 comments
Log in or register to join the discussion
  • Power grid point response...

    >>Do we hold computing power to a higher standard than the electric grid?<<
    In short - Yes. If you need power reliability for your business, a simple generator and battery back up are easily attainable (and affordable) solutions. It is much less likely said business can afford a "back-up" computing cluster for their data (else the whole point of highly-available clusters is pointless).
    jparrott@...
    • Still, for all but the very largest companies, they do NOT have emergency

      generators for electricity. For hosted web sites, Amazon is still more reliable than anything a small of mid-sized company could do. For larger company, it is just as reliable. This is just like the people that won't fly because of the big spectacular accidents.
      DonnieBoy
      • ignorant assumptions

        [i]For hosted web sites, Amazon is still more reliable than anything a small of mid-sized company could do. For larger company, it is just as reliable.[/i]

        You mean to say, as far as YOU know. Being a well-trained system architect for companies such as Citibank, American Express, First Data, failure is not an option.

        When a CC transaction fails, or doesn't post, money is lost. So is someone's job. Your statements are ignorant assumptions that smack of a lack of experience, so let me help you.

        When it comes to hosted solutions, thare are plenty of hosts for small businesses to choose from, and for medium and large orgs, even more colocation opportunities to choose from with very nice management facilities and all hours access too.
        kckn4fun
  • SaaS

    And herein lies the proof as to why SaaS will *never* reach the kind of market saturation that some analysts think it will. Salesforce.com is another case study. There are companies which bullet proof and fault testing for a living. That's all they do. For big companies, they foot the bill for this. Then the knowledge filters through the cracks to smaller companies.

    I'd say that safely 7 out of 10 of my customers point to failures such as these as proof that they would never allow their infrastructure to be outsourced.

    Recently there was an article here on ZDNet about some CEO who was saying SaaS vendors needed to practice what they sell. If your living relied on something with a managable failure rate, would you want someone else to manage that? I wouldn't.

    Better yet, if you had a trigger lock on a gun stored in your home, would you give the key (for storage and safe keeping) to some stranger (aka, the lowest bidder)?
    kckn4fun
    • Hmm..

      9 out of 10 of your customers have experienced more
      significant outages with in house managed systems. Amazon
      will improve their process and learn from this.
      Wayzom
  • Do we hold computing power to a higher standard than the electric grid?

    Yes, yes we do. We expect 24/7 server uptime and as a result professional data centers always have backup generators that turn on immediately after utility power is lost. It is absolutely critical to many people's businesses that their website and its contents always remain online and as a result it is a absolute necessity that online property uptime be held to a much more stringent expectation then that of utility power uptime.

    - John Musbach
    John Musbach
  • RE: Amazon explains its S3 outage

    Amazon take note:Stick to selling dog food, not storage. Im moving over to Nirvanix.
  • This sounds really fishy...

    From Amazon's "official explanation page" at http://status.aws.amazon.com/s3-20080720.html

    "More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect."

    A "single bit" corruption would be detected even by the UDP checksum mechanism (which is guaranteed to catch any single-bit error).

    So, either Amazon uses something even more primitive than UDP in their inter-server messages (which I don't believe), or they are flat-out lying.
    zaragatunga