Amazon S3 web services down. Bad, bad news for customers.

Amazon S3 web services down. Bad, bad news for customers.

Summary: Update 2/16/08, 2:30PM EST: Nick Carr wrote a good post-mortem. So did Larry Dignan.

SHARE:

Update 2/16/08, 2:30PM EST: Nick Carr wrote a good post-mortem. So did Larry Dignan.

Update 2/16/08, 11:00AM EST: Amazon reports the cause of the problem:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests.  Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls.  The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.  In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles.  This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.

Do I interpret this to mean the service crashed because a few customers tried to use it in an unexpected manner?

Update 2/15/08, 4:00PM EST: Back to normal. From the forum:

The team continues to be heads down focused on getting to root cause on this morning’s problem.   One of our three geographic locations for S3 was unreachable beginning at 4:31 a.m. PST and was back to near normal performance at 6:48 a.m. PST (a small number of customers experienced intermittent issues for a short period thereafter).

UPDATE 2/15/08, 2:00PM EST Isolated problems still being reported by users. Amazon blog remains silent. Blog swarm grows.

UPDATE 2/15/08, 1:15PM EST Amazon blog remains silent on the issue. Guys, your silence is deafening.

UPDATE 2/15/08, 12:26PM EST: Some users still reporting problems on the forum

UPDATE 2/15/08, 11:26AM EST: Problems still being reported.

UPDATE 2/15/08, 10:26AM EST: Amazon's service has been restored. See below for more information.

==========

Amazon S3 web services is currently down. From a message on their forum:

 Massive (500) Internal Server Error.outage started 35 minutes ago.  Sample response:

<?xml version="1.0" encoding="UTF-8"?> <Error><Code>InternalError</Code> <Message>We encountered an internal error. Please try again.</Message> <RequestId>A2A7E5395E27DFBB</RequestId> <HostId>f691zulHNsUqonsZkjhILnvWwD3ZnmOM4 ObM1wXTc6xuS3GzPmjArp8QC/sGsn6K</HostId> </Error>---> System.Net.WebException: The remote server returned an error:(500) Internal Server Error.

For users and companies who signed up hoping for mission critical service, this is bad news indeed.

It's made worse for Amazon by speculation that EMC will host Business byDesign, SAP's forthcoming Software as a Service offering (SaaS). If this news is correct, it's a vote of confidence in EMC's, rather than Amazon's, hosting abilities.

Updates: Here's a link to Amazon's SLA. ZDNet's Larry Dignan has posted about what the SLA actually means.

Two and a half hours into the problem, here's Amazon's response so far:

We're investigating.

And here's a user's response:

News from Amazon?? I need to say something to my clients.

Just after 2.5 hours down, some users (but not all) are reporting that service is returning:

Its back up for us.

Three hours into the problem, Amazon has finally deigned to respond:

We’ve resolved this issue, and performance is returning to normal levels for all Amazon Web Services that were impacted.  We apologize for the inconvenience.  Please stay tuned to this thread for more information about this issue.

In comparison to this lame response, read what Technorati did when their service had problems. Note the Amazon problem was far, far more severe than Technorati's. I'm glad the blogs have picked this up.

This must be very big or Amazon would have commented already. They will lose customers over this.

Topics: Hardware, Amazon, Cloud, Emerging Tech, Servers

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

11 comments
Log in or register to join the discussion
  • And this is why

    Web 2.0 will die. Only a fool trusts sensitive data to a third party they have no control
    over.
    frgough
    • RE: And this is why

      [b]YOU SAID IT!!!!!!!!![/b]

      At least, when it is [i]your[/i] system; you know whose [b]ass[/b] to light a fire under!!!!!

      In my book - [b]RFD!!!![/b]
      fatman65535
    • Well, a small to midsize business has the same type of problem finding

      employees they can trust, and reliability problems with the infrastructure that they buy. You could be down more than three hours with a failed disk drive for instance.

      All in all, you are safer at Amazon, though Amazon has not excuse for this either.
      DonnieBoy
  • Service interruptions inevitable

    I don't care who it is. Amazon, Google, Salesforce.com, MSFT...everyone is going to have an outage. Any service that rides on the internet is subject to interruption. Sometimes all it takes is some bleary-eyed construction worker on a track hoe. You could be out for days if they trench across your fiber.

    That doesn't mean you still shouldn't consider outsource providers, just plan for something less than 100% uptime. The more 9's you add on the end of "99." adds geometrically to the costs involved. I think it's more practical to plan for an extended outage than spend increasing amounts of money on diminishing returns.

    In my experience it's more cost effective to have SOP's to switch over to local backup processes. I built a system for a customer that tracked application processing at their district offices. We spent weeks discussing contingency options in the event of a service outage. The optimal solution turned out to be using a white board as local backup. With that simple solution we were able to do away with mirroring, redundant hosting, and a raft of other expensive options. I think the white board option cost something like five grand, which included installation.

    Unfortunately we had a chance to test that plan when the RAID array on the DB server went south. Took three days to restore the db server. They tottered along with the white boards and hired temps to update the data when the system returned. Worked fine. The second test happened when hurricane Katrina wiped out their southern district office. The white board continued to operate within nominal parameters at the backup office we established and later shifted to Houston. ;)

    Try to get execs to see that, though. You might as well be trying to stop the wind. They'll spend 10's even 100's of thousands on redundant systems that operate a few minutes out of a year. I think it's crazy but they'd rather spend the bucks than explain a service interrupt.
    Chad_z
  • clouds are not magic ...

    Nice post Michael. This outage is a perfect example of why it will always remain the app owner's responsibility to ensure the SLA - whether you use clouds, classic-in-house, or a mix of infrastructure.

    More here: http://www.appistry.com/blogs/bob/amazon-s3-still-limping-limits-clouds

    Bob
    boblozano
  • RE: Amazon S3 web services down. Bad, bad news for customers.

    Dad, my cell phone broke!

    Last summer my 15 year old greeted me - ???Dad! My cell phone is broke and I can???t text my friends!??? ???mmm.. this is serious. What did you use to do before I bought you the phone???? ???I wasn???t able to text back then, dah!!???

    Today???s uproar re: the Amazon S3 outage takes me back to that funny moment when my daughter finally got my point ??? that I enabled her to enjoy the world of texting. And, like many AWS bloggers today, she did not appreciate that I gave her this gift.

    So, to put a big picture perspective on today???s outage ??? most of us start ups, if not for AWS, would have burned thru our angle and round A funds to replicate AWS before we would have hit the tipping point and had the luxury of telling our customers that ???we are experiencing an outage.???

    Looking back on my "old school" days of expensive networks, users running out of storage and the constant flow of cash to admin staff, I must admit to having a soft spot for the AWS team and service. In those days, a two hour outage was considered an opportunity for our users to chat with the cube neighbor or go down to the cafeteria for a donut. Fast forward to today???s demanding customers and an outage of minutes starts Armageddon. Now, imagine if by some miracle, these customers actually pay for the start up???s service.

    Today, I welcomed the outage as it reinforced my need for AWS. How would my small team respond to an outage? We don???t have the talented staff nor the passion the AWS team has. We forget that Amazon is in the small group of visionary ???start-ups??? who helped get the net to where we are today.

    Phil Easter
    CTO/AirMe
    phil.easter@...
  • RE: Amazon S3 web services down. Bad, bad news for customers.

    I think it is a little unfair for you to compare with Technorati's e-mail. T's e-mail was done after two day's analysis. Amazon's post was within ours of being hit. As they said, they'll provide further updates as the investigation proceeds. I see no reason to expect they will having but a comprehensive report of what happened shortly.
    cbsmith@...
    • Must disagree

      Although I understand your point, and considered it myself, there's an important difference between Technorati and AWS. AWS is a mission critical backbone for many businesses, which is simply not the case with Technorati. When Technorati goes down, it's inconvenient and a hassle. When AWS goes down, revenue for some companies can grind to a halt and customer relations can be negatively affected. In short, I think AWS should be held to a much higher standard of accountability.

      Thanks for commenting and highlighting and important issue.
      mkrigsman@...
  • No excuse, and they WILL have to look for ways to make more robust. But,

    if you roll it yourself, you can lose a hard drive, a database problem, etc, and be down a lot more than three hours, AND, it is YOUR problem, not Amazon's. So, again, no excuse, but still probably more reliable than a small to mid-size business can do for themselves.
    DonnieBoy
  • RE: Amazon S3 web services down. Bad, bad news for customers.

    Another business that needs to run to the library for "I.T. Wars: Managing the Business-Technology Weave in the New Millennium." The irony is, Amazon sells it. I urge every business person and IT person, management or staff, to get hold of a copy of this book. Our CEO has read it. Our project managers are on their second reading. Our vendors are required to read it (they can borrow our copies if they don't want to purchase it). Any agencies that wish to partner with us: We ask that they read it. Do yourself a favor and read this book - then ask your boss to read it - then ask your staff and co-workers to read it. If you get a chance, read the author's interview here: http://www.businessforum.com/DScott_02.html
    johnfranks999
  • Amazon S3 Alternative

    Connectria Hosting offers Amazon S3?? compatible storage that???s more reliable, more secure & easier to use. There is a pricing calculator right on the page so you see the costs upfront.
    ConnectriaHosting