Update 2/16/08, 2:30PM EST: Nick Carr wrote a good post-mortem. So did Larry Dignan.
Update 2/16/08, 11:00AM EST: Amazon reports the cause of the problem:
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.
Do I interpret this to mean the service crashed because a few customers tried to use it in an unexpected manner?
Update 2/15/08, 4:00PM EST: Back to normal. From the forum:
The team continues to be heads down focused on getting to root cause on this morning’s problem. One of our three geographic locations for S3 was unreachable beginning at 4:31 a.m. PST and was back to near normal performance at 6:48 a.m. PST (a small number of customers experienced intermittent issues for a short period thereafter).
UPDATE 2/15/08, 2:00PM EST Isolated problems still being reported by users. Amazon blog remains silent. Blog swarm grows.
UPDATE 2/15/08, 1:15PM EST Amazon blog remains silent on the issue. Guys, your silence is deafening.
UPDATE 2/15/08, 12:26PM EST: Some users still reporting problems on the forum
UPDATE 2/15/08, 11:26AM EST: Problems still being reported.
UPDATE 2/15/08, 10:26AM EST: Amazon's service has been restored. See below for more information.
Amazon S3 web services is currently down. From a message on their forum:
Massive (500) Internal Server Error.outage started 35 minutes ago. Sample response:
<?xml version="1.0" encoding="UTF-8"?>
<Message>We encountered an internal error. Please try again.</Message>
</Error>---> System.Net.WebException: The remote server returned an error:(500) Internal Server Error.
For users and companies who signed up hoping for mission critical service, this is bad news indeed.
It's made worse for Amazon by speculation that EMC will host Business byDesign, SAP's forthcoming Software as a Service offering (SaaS). If this news is correct, it's a vote of confidence in EMC's, rather than Amazon's, hosting abilities.
Updates: Here's a link to Amazon's SLA. ZDNet's Larry Dignan has posted about what the SLA actually means.
Two and a half hours into the problem, here's Amazon's response so far:
And here's a user's response:
News from Amazon?? I need to say something to my clients.
Just after 2.5 hours down, some users (but not all) are reporting that service is returning:
Its back up for us.
Three hours into the problem, Amazon has finally deigned to respond:
We’ve resolved this issue, and performance is returning to normal levels for all Amazon Web Services that were impacted. We apologize for the inconvenience. Please stay tuned to this thread for more information about this issue.
In comparison to this lame response, read what Technorati did when their service had problems. Note the Amazon problem was far, far more severe than Technorati's. I'm glad the blogs have picked this up.
This must be very big or Amazon would have commented already. They will lose customers over this.