​How Amazon Web Services crashed and rose again

September 20th was a bad day for Amazon. For five hours, the East coast AWS was misbehaving. Here's why it happened and how AWS was restored to service.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

If Amazon Web Service (AWS) had gone down on Monday, September 21, morning, instead of Sunday, September 20, people would still be screaming about it. Instead, it went down at 3 AM Pacific Daylight Time (PDT) and barely anyone noticed.

Unless, of course, you were a system administrators of popular user services and websites, such as Amazon Video and Reddit. If you were one of these people, you noticed. Boy did you notice.

Only one major AWS customer, Netflix, seems to have been ready for a major AWS data-center failure. No one else seems to have been.

You see this wasn't a "simple" data-center problem like a backhoe taking out AWS US-East's Internet backbone. No, it was much more complicated.

It all began with the Amazon DynamoDB service in Virginia having problems. DynamoDB is a fast, flexible NoSQL database service. It's designed to support applications, which require consistent, single-digit millisecond latency at scale. That, as you would guess, means that it's used by many, if not all time-sensitive AWS cloud services.

Officially, an AWS spokesperson said, "Between 2:13 AM and 7:10 AM PDT on September 20, 2015, AWS experienced significant error rates with read and write operations for the Amazon DynamoDB service in the US-East Region, which impacted some other AWS services in that region, and caused some AWS customers to experience elevated error rates."

When DynamoDB started having read/write issues its performance started collapsing. This issue impacted some other AWS services in US East. When that happened, all the other US East AWS services application programming interfaces (API)s started timing out. From there, services built on AWS started failing.

Some customers were affected more than others. In most cases what happened was that these customers experienced an increase in errors. This prevented some customers from accessing their sites and services. Many of these sites did not "go down," but their performance fell to unacceptable levels.

According to AWS Service Health Dashboard entry for DynamoDB that Sunday, here's how the problem manifested.

3:00 AM PDT We are investigating increased error rates for API requests in the US-EAST-1 Region.

3:26 AM PDT We are continuing to see increased error rates for all API calls in DynamoDB in US-East-1. We are actively working on resolving the issue.

4:05 AM PDT We have identified the source of the issue. We are working on the recovery.

4:41 AM PDT We continue to work towards recovery of the issue causing increased error rates for the DynamoDB APIs in the US-EAST-1 Region.

4:52 AM PDT We want to give you more information about what is happening. The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information. Our recovery efforts are now focused on restoring metadata operations. We will be throttling APIs as we work on recovery.

So, Amazon took two hours to nail down the root cause. They then slowed down all AWS APIs so their system administrators could work on the problem.

5:22 AM PDT We can confirm that we have now throttled APIs as we continue to work on recovery.

5:42 AM PDT We are seeing increasing stability in the metadata service and continue to work towards a point where we can begin removing throttles.

6:19 AM PDT The metadata service is now stable and we are actively working on removing throttles.

7:12 AM PDT We continue to work on removing throttles and restoring API availability but are proceeding cautiously.

7:22 AM PDT We are continuing to remove throttles and enable traffic progressively.

7:40 AM PDT We continue to remove throttles and are starting to see recovery.

7:50 AM PDT We continue to see recovery of read and write operations and continue to work on restoring all other operations.

8:16 AM PDT We are seeing significant recovery of read and write operations and continue to work on restoring all other operations.

So, from start to finish, it took AWS just over five hours to get back to full-speed.

In theory, July 16 Amazon DynamoDB release could have helped mitigate this problem. That's because this release include DynamoDB cross-region replication. With this client-side solution, AWS customers could have maintained identical copies of DynamoDB tables across different AWS regions, in near real time. With this, you can, for additional fees of course, use cross region replication to back up DynamoDB tables, or to provide low-latency access to geographically distributed data.

Still, as this episode showed, even the largest cloud provider in the world can have major failures. If your business depends on always being available, investing in DynamoDB cross-region replication, would be a smart move.

Related Stories:

Editorial standards