AWS: Here's what went wrong in our big cloud-computing outage

AWS says sorry for a long outage after tripping up on address translation between main and internal networks.
Written by Liam Tung, Contributing Writer

Amazon Web Services (AWS) rarely goes down unexpectedly, but you can expect a detailed explainer when a major outage does happen. 

12/15 update: AWS misfires once more, just days after a massive failure

The latest of AWS's major outages occurred at 7:30AM PST on Tuesday, December 7, lasted five hours and affected customers using certain application interfaces in the US-EAST-1 Region. In a public cloud of AWS's scale, a five-hour outage is a major incident.

According to AWS's explanation of what went wrong, the source of the outage was a glitch in its internal network that hosts "foundational services" such as application/service monitoring, the AWS internal Domain Name Service (DNS), authorization, and parts of the Elastic Cloud 2 (EC2) network control plane. DNS was important in this case as it's the system used to translate human-readable domain names to numeric internet (IP) addresses.

SEE: Having a single cloud provider is so last decade

AWS's internal network underpins parts of the main AWS network that most customers connect with in order to deliver their content services. Normally, when the main network scales up to meet a surge in resource demand, the internal network should scale up proportionally via networking devices that handle network address translation (NAT) between the two networks. 

However, on Tuesday last week, the cross-network scaling didn't go smoothly, with AWS NAT devices on the internal network becoming "overwhelmed", blocking translation messages between the networks with severe knock-on effects for several customer-facing services that, technically, were not directly impacted. 

"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network," AWS says in its postmortem. 

"This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks." 

The delays spurred latency and errors for foundational services talking between the networks, triggering even more failing connection attempts that ultimately led to "persistent congestion and performance issues" on the internal network devices.   

With the connection between the two networks blocked up, the AWS internal operating team quickly lost visibility into its real-time monitoring services and were forced to rely on past-event logs to figure out the cause of the congestion. After identifying a spike in internal DNS errors, the teams diverted internal DNS traffic away from blocked paths. This work was completed two hours after the initial outage at 9:28AM PST.    

This alleviated impact on customer-facing services but didn't fully fix affected AWS services or unblock NAT device congestion. Moreover, the AWS internal ops team still lacked real-time monitoring data, subsequently slowing recovery and restoration. 

Besides lacking real-time visibility, AWS internal deployment systems were hampered, again slowing remediation. The third major cause of its non-optimal response was concern that a fix for internal-to-main network communications would disrupt other customer-facing AWS services that weren't affected. 

"Because many AWS services on the main AWS network and AWS customer applications were still operating normally, we wanted to be extremely deliberate while making changes to avoid impacting functioning workloads," AWS said. 

So what AWS customers services were impacted?    

First, the main AWS network was not affected, so AWS customer workloads were "not directly impacted", AWS says. Rather, customers were affected by AWS services that rely on its internal network. 

However, the knock-on effects from the internal network glitch were far and wide for customer-facing AWS services, affecting everything from compute, container and content distribution services to databases, desktop virtualization and network optimization tools.  

AWS control planes are used to create and manage AWS resources. These control planes were affected as they are hosted on the internal network. So, while EC2 instances were not affected, the EC2 APIs customers use to launch new EC2 instances were. Higher latency and error rates were the first impacts customers saw at 7:30AM PST. 

SEE: Cloud security in 2021: A business guide to essential tools and best practices

With this capability gone, customers had trouble with Amazon RDS (relational database services) and the Amazon EMR big data platform, while customers with Amazon Workspaces's managed desktop virtualization service couldn't create new resources. 

Similarly, AWS's Elastic Cloud Balancers (ELB) were not directly affected but, since ELB APIs were, customers couldn't add new instances to existing ELBs as quickly as usual.   

Route 53 (CDN) APIs were also impaired for five hours, preventing customers changing DNS entries. There were also login failures to the AWS Console, latency affecting Amazon Secure Token Services for third-party identity services, delays to CloudWatch, and impaired access to Amazon S3 buckets, DynamoDB tables via VPC Endpoints, and problems invoking serverless Lambda functions.   

The December 7 incident shared at least one trait with a major outage that occurred this time last year: it stopped AWS from communicating swiftly with customers about the incident via the AWS Service Health Dashboard. 

"The impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region," AWS explained. 

Additionally, the AWS support contact center relies on the AWS internal network, so staff couldn't create new cases at normal speed during the five-hour disruption.

AWS says it will release a new version of its Service Health Dashboard early 2022, which will run across multiple regions to "ensure we do not have delays in communicating with customers."

Cloud outages do happen. Google Cloud has had its fare share and Microsoft in October had to explain its eight-hour outage. While rare, the outages are a reminder that public cloud might be more reliable than conventional data centers, but things do go wrong, sometimes catastrophically, and can impact a wide number of critical services. 

"Finally, we want to apologize for the impact this event caused for our customers," said AWS. "While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further."

Editorial standards