Amazon has issued a statement that adds a little more clarity to its Web services outage on Friday.
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
As we said earlier today, though we're proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
Sincerely, The Amazon Web Services Team
Nick Carr has more. A few takeaways:
- Amazon is creating an uptime dashboard. That's a positive development.
- For now, you can chalk this outage up to growing pains at Amazon's Web services.
- Amazon needs to work on its customer communication.
- These cloud services are really an expectations game. As some talkbackers have noted electric service also goes down from time to time too. Do we hold computing power to a higher standard than the electric grid?
The customer reaction to Amazon's explanation verified those aforementioned takeaways. This post summed the customer perspective up well.
Thanks for the update. As to your longer-term plans to handle this, your plan to provide a "service health dashboard" is a particular good idea!! However to provide a truly excellent service health solution, Amazon need to provide machine-readable management data that we can integrate in our infrastrucure so that we in turn can tell our customers what is going on! Besides machine-readable info, aws blog-updates, RSS feeds and email notifications of major service health issues is a must! In addition customers like us need to be able to setup a error page redirect for if EC2 is down (so that users trying to access a web server hosted on EC2 will get a decent error if your normal EC2 infrastructure is down).
BTW: Our company's biggest complaint is not that your servers where down but that we had to do quite some detective work to find out why (an error on our own servers?, a recent code update we did ? or the amazon aws itself ?). In this case lack of information from Amazon cost us more trouble/money that the actually outage. It is simply NOT good enough that you require your customers to browse through forum threads to find out what is wrong. There was NO info you your front page, no email notifications, no updates on your blog.