Amazon web services (AWS) experienced a substantial service outage on July 20, disrupting customers across the web. Their post-mortem demonstrates an unusual level of organizational maturity for an Enterprise 2.0 company.
Here's the technical description of cause:
We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
The reports presents a management perspective as well:
During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.
THE PROJECT FAILURES ANALYSIS
In analyzing the failure, Amazon asked four questions:
What happened? The first step to a successful post-mortem is establishing a clear understanding of what went wrong. You can't analyze what you don't understand.
Why did it happen? After after determining the facts, the post-mortem team should assess why failure occurred. In addition to looking strictly at technical causes of failure, also examine the underlying organizational, management, and team environment. Be aware some team members may ignore warning signs of impending disaster, fearing blame for issues over which they have no control.
How did we respond and recover? The mirror of hindsight can be painful, which is why many organizations fail at this stage. A useful post-mortem depends on the analysis team gaining a reasonable level of honesty, insight, and cooperation from the organization. If your company is saddled with a culture of blame, where management avoids its own responsibilities and turns individuals into scapegoats, then the entire post-mortem process is probably a waste of time. (If that's the case, I suggest team members consider finding another job; life is just too short.)
How can we prevent similar unexpected issues from having system-wide impact? Unexpected technical issues do arise in mission-critical or complex hardware and software systems. However, the key to prevention is technical planning to prevent narrow problems from propagating through the entire system. Planning must also consider the business process and management responses the team initiates when a failure occurs. A complete post-mortem addresses both technical and management issues.
Amazon's technical failure disrupted its customers' business and hurt the company's credibility. However, their open and transparent response to the failure and its aftermath demonstrates a level of organizational maturity rarely found among Enterprise 2.0 companies.