ie8 fix
madison

Amazon S3 web services down. Bad, bad news for customers.

By | February 15, 2008, 6:38am PST

Summary: Update 2/16/08, 2:30PM EST: Nick Carr wrote a good post-mortem. So did Larry Dignan. Update 2/16/08, 11:00AM EST: Amazon reports the cause of the problem: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these [...]

Update 2/16/08, 2:30PM EST: Nick Carr wrote a good post-mortem. So did Larry Dignan.

Update 2/16/08, 11:00AM EST: Amazon reports the cause of the problem:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests.  Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls.  The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.  In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles.  This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.

Do I interpret this to mean the service crashed because a few customers tried to use it in an unexpected manner?

Update 2/15/08, 4:00PM EST: Back to normal. From the forum:

The team continues to be heads down focused on getting to root cause on this morning’s problem.   One of our three geographic locations for S3 was unreachable beginning at 4:31 a.m. PST and was back to near normal performance at 6:48 a.m. PST (a small number of customers experienced intermittent issues for a short period thereafter).

UPDATE 2/15/08, 2:00PM EST Isolated problems still being reported by users. Amazon blog remains silent. Blog swarm grows.

UPDATE 2/15/08, 1:15PM EST Amazon blog remains silent on the issue. Guys, your silence is deafening.

UPDATE 2/15/08, 12:26PM EST: Some users still reporting problems on the forum

UPDATE 2/15/08, 11:26AM EST: Problems still being reported.

UPDATE 2/15/08, 10:26AM EST: Amazon’s service has been restored. See below for more information.

==========

Amazon S3 web services is currently down. From a message on their forum:

 Massive (500) Internal Server Error.outage started 35 minutes ago.  Sample response:

<?xml version=”1.0″ encoding=”UTF-8″?>
<Error><Code>InternalError</Code>
<Message>We encountered an internal error. Please try again.</Message>
<RequestId>A2A7E5395E27DFBB</RequestId>
<HostId>f691zulHNsUqonsZkjhILnvWwD3ZnmOM4
ObM1wXTc6xuS3GzPmjArp8QC/sGsn6K</HostId>
</Error>—> System.Net.WebException: The remote server returned an error:(500) Internal Server Error.

For users and companies who signed up hoping for mission critical service, this is bad news indeed.

It’s made worse for Amazon by speculation that EMC will host Business byDesign, SAP’s forthcoming Software as a Service offering (SaaS). If this news is correct, it’s a vote of confidence in EMC’s, rather than Amazon’s, hosting abilities.

Updates: Here’s a link to Amazon’s SLA. ZDNet’s Larry Dignan has posted about what the SLA actually means.

Two and a half hours into the problem, here’s Amazon’s response so far:

We’re investigating.

And here’s a user’s response:

News from Amazon?? I need to say something to my clients.

Just after 2.5 hours down, some users (but not all) are reporting that service is returning:

Its back up for us.

Three hours into the problem, Amazon has finally deigned to respond:

We’ve resolved this issue, and performance is returning to normal levels for all Amazon Web Services that were impacted.  We apologize for the inconvenience.  Please stay tuned to this thread for more information about this issue.

In comparison to this lame response, read what Technorati did when their service had problems. Note the Amazon problem was far, far more severe than Technorati’s. I’m glad the blogs have picked this up.

This must be very big or Amazon would have commented already. They will lose customers over this.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Michael Krigsman is a recognized authority on the causes and prevention of IT failures.

Disclosure

Michael Krigsman

Michael Krigsman writes and speaks about technology in a manner that most observers consider to be fair and balanced. Michael believes that writing about IT failures, which often have complex causes, creates a unique obligation to be reasonable and accurate in both reporting and analysis.

Michael maintains active personal and professional relationships with enterprise technology buyers, vendors, analyst firms (or individual analysts), consultants, and system integrators. As CEO of Asuret, Michael sells and delivers paid services to members of these same groups.

Vendors regularly reimburse Michael's out-of-pocket travel expenses to attend industry conferences and events. Conference organizers frequently waive entry fees when Michael attends industry events. Michael often speaks at industry conferences and events.

He is a member of the Enterprise Irregulars, a loose association of consultants, investors, industry representatives, analysts, and users of enterprise software.

For daily updates on Michael's activities, follow him on Twitter.

Biography

Michael Krigsman

Michael Krigsman is CEO of Asuret, Inc., a consulting company dedicated to reducing technology implementation failures. Asuret's suite of software tools improve the success rate of enterprise software deployments by quantifying and measuring governance issues that cause most project failures. Michael led the research effort underlying Asuret's model of collective intelligence and its practical application to reducing IT failures in consulting environments. He is a recognized authority on the causes and prevention of IT failures and is frequently quoted in the press on IT project and related CIO issues. He is considered an enterprise software industry "influencer" and provides advice to technology buyers, vendors, and services firms.

Previously, Michael served as CEO of Cambridge Publications, which develops tools and processes for software implementations and related business practice automation projects. Michael has been involved with hundreds of software development projects, for companies ranging from small startups to Fortune 500 organizations. Michael graduated with an M.B.A. from Boston University and a B.A. from Bard College. He is a Board member of the America's Cup Hall of Fame and the Herreshoff Marine Museum in Bristol, RI.

Related Discussions on TechRepublic

Did you know you can take part in these discussions with your ZDNet membership?
10
Comments

Join the conversation!

Just In

RE: Amazon S3 web services down. Bad, bad news for customers.
johnfranks999 6th Jun 2008
Another business that needs to run to the library for "I.T. Wars: Managing the Business-Technology Weave in the New Millennium." The irony is, Amazon sells it. I urge every business person and IT person, management or staff, to get hold of a copy of this book. Our CEO has read it. Our project managers are on their second reading. Our vendors are required to read it (they can borrow our copies if they don't want to purchase it). Any agencies that wish to partner with us: We ask that they read it. Do yourself a favor and read this book - then ask your boss to read it - then ask your staff and co-workers to read it. If you get a chance, read the author's interview here: http://www.businessforum.com/DScott_02.html
0 Votes
+ -
And this is why
frgough 15th Feb 2008
Web 2.0 will die. Only a fool trusts sensitive data to a third party they have no control
over.
0 Votes
+ -
RE: And this is why
fatman65535 15th Feb 2008
YOU SAID IT!!!!!!!!!

At least, when it is your system; you know whose ass to light a fire under!!!!!

In my book - RFD!!!!
employees they can trust, and reliability problems with the infrastructure that they buy. You could be down more than three hours with a failed disk drive for instance.

All in all, you are safer at Amazon, though Amazon has not excuse for this either.
0 Votes
+ -
Service interruptions inevitable
Chad_z 15th Feb 2008
I don't care who it is. Amazon, Google, Salesforce.com, MSFT...everyone is going to have an outage. Any service that rides on the internet is subject to interruption. Sometimes all it takes is some bleary-eyed construction worker on a track hoe. You could be out for days if they trench across your fiber.

That doesn't mean you still shouldn't consider outsource providers, just plan for something less than 100% uptime. The more 9's you add on the end of "99." adds geometrically to the costs involved. I think it's more practical to plan for an extended outage than spend increasing amounts of money on diminishing returns.

In my experience it's more cost effective to have SOP's to switch over to local backup processes. I built a system for a customer that tracked application processing at their district offices. We spent weeks discussing contingency options in the event of a service outage. The optimal solution turned out to be using a white board as local backup. With that simple solution we were able to do away with mirroring, redundant hosting, and a raft of other expensive options. I think the white board option cost something like five grand, which included installation.

Unfortunately we had a chance to test that plan when the RAID array on the DB server went south. Took three days to restore the db server. They tottered along with the white boards and hired temps to update the data when the system returned. Worked fine. The second test happened when hurricane Katrina wiped out their southern district office. The white board continued to operate within nominal parameters at the backup office we established and later shifted to Houston. wink

Try to get execs to see that, though. You might as well be trying to stop the wind. They'll spend 10's even 100's of thousands on redundant systems that operate a few minutes out of a year. I think it's crazy but they'd rather spend the bucks than explain a service interrupt.
0 Votes
+ -
clouds are not magic ...
boblozano 15th Feb 2008
Nice post Michael. This outage is a perfect example of why it will always remain the app owner's responsibility to ensure the SLA - whether you use clouds, classic-in-house, or a mix of infrastructure.

More here: http://www.appistry.com/blogs/bob/amazon-s3-still-limping-limits-clouds

Bob
0 Votes
+ -
Dad, my cell phone broke!

Last summer my 15 year old greeted me - ???Dad! My cell phone is broke and I can???t text my friends!??? ???mmm.. this is serious. What did you use to do before I bought you the phone???? ???I wasn???t able to text back then, dah!!???

Today???s uproar re: the Amazon S3 outage takes me back to that funny moment when my daughter finally got my point ??? that I enabled her to enjoy the world of texting. And, like many AWS bloggers today, she did not appreciate that I gave her this gift.

So, to put a big picture perspective on today???s outage ??? most of us start ups, if not for AWS, would have burned thru our angle and round A funds to replicate AWS before we would have hit the tipping point and had the luxury of telling our customers that ???we are experiencing an outage.???

Looking back on my "old school" days of expensive networks, users running out of storage and the constant flow of cash to admin staff, I must admit to having a soft spot for the AWS team and service. In those days, a two hour outage was considered an opportunity for our users to chat with the cube neighbor or go down to the cafeteria for a donut. Fast forward to today???s demanding customers and an outage of minutes starts Armageddon. Now, imagine if by some miracle, these customers actually pay for the start up???s service.

Today, I welcomed the outage as it reinforced my need for AWS. How would my small team respond to an outage? We don???t have the talented staff nor the passion the AWS team has. We forget that Amazon is in the small group of visionary ???start-ups??? who helped get the net to where we are today.

Phil Easter
CTO/AirMe
I think it is a little unfair for you to compare with Technorati's e-mail. T's e-mail was done after two day's analysis. Amazon's post was within ours of being hit. As they said, they'll provide further updates as the investigation proceeds. I see no reason to expect they will having but a comprehensive report of what happened shortly.
0 Votes
+ -
Contributr
Must disagree
mkrigsman@... 15th Feb 2008
Although I understand your point, and considered it myself, there's an important difference between Technorati and AWS. AWS is a mission critical backbone for many businesses, which is simply not the case with Technorati. When Technorati goes down, it's inconvenient and a hassle. When AWS goes down, revenue for some companies can grind to a halt and customer relations can be negatively affected. In short, I think AWS should be held to a much higher standard of accountability.

Thanks for commenting and highlighting and important issue.
if you roll it yourself, you can lose a hard drive, a database problem, etc, and be down a lot more than three hours, AND, it is YOUR problem, not Amazon's. So, again, no excuse, but still probably more reliable than a small to mid-size business can do for themselves.
Another business that needs to run to the library for "I.T. Wars: Managing the Business-Technology Weave in the New Millennium." The irony is, Amazon sells it. I urge every business person and IT person, management or staff, to get hold of a copy of this book. Our CEO has read it. Our project managers are on their second reading. Our vendors are required to read it (they can borrow our copies if they don't want to purchase it). Any agencies that wish to partner with us: We ask that they read it. Do yourself a favor and read this book - then ask your boss to read it - then ask your staff and co-workers to read it. If you get a chance, read the author's interview here: http://www.businessforum.com/DScott_02.html

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix
Click Here
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix
ie8 fix