Amazon Web Services on Friday provided its post-mortem on its outage that stretched on for more than a day. While the apology and recap of events is notable, the biggest takeaway is that Amazon promises to have better communications.
If we've learned anything from the spate of outages recently communication is everything.
The nut of AWS' communications and transparency policy boils down to this excerpt:
In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again. Most of the AWS team, including the entire senior leadership team, was directly involved in helping to coordinate, troubleshoot and resolve the event. Initially, our primary focus was on thinking through how to solve the operational problems for customers rather than on identifying root causes. We felt that that focusing our efforts on a solution and not the problem was the right thing to do for our customers, and that it helped us to return the services and our customers back to health more quickly. We updated customers when we had new information that we felt confident was accurate and refrained from speculating, knowing that once we had returned the services back to health that we would quickly transition to the data collection and analysis stage that would drive this post mortem.
That said, we think we can improve in this area. We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation.
In addition, Amazon is providing a 10 day credit for 100 percent usage for elastic block stores (EBS).
As for the post mortem, AWS has provided a lot of detail. The cascading effects of the network outage are worth the read.
- Amazon Web Services outage: 'Detailed post mortem' coming
- Cloud talk: How Okta stayed running during AWS outage
- Lessons from Amazon’s outage.
- Amazon’s N. Virginia EC2 cluster down, ‘networking event’ triggered problems
- Amazon’s Web Services outage: End of cloud innocence?
- Whether it’s Amazon or Microsoft, there’s (still) no foolproof cloud