Amazon knocked AWS sites offline because of typo

AWS explains how S3 storage at its massive US-EAST-1 region was disrupted and what it's doing to prevent this from happening again.
Written by Stephanie Condon, Senior Writer

The hours-long Amazon Web Services incident that knocked major sites offline and caused problems for several others on Tuesday was caused by a typo, AWS reported Thursday.

The cloud infrastructure provider issued the following explanation:

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

The mistake inadvertently took down two subsystems critical to all S3 objects in the US-EAST-1 region -- a massive datacenter location that also happens to be Amazon's oldest. Both systems required a full restart. That process, along with running the necessary safety checks, "took longer than expected," Amazon noted.

While they were being restarted, S3 was unable to service requests. Other AWS services in the region that rely on S3 for storage were also impacted, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda.

The index subsystem was fully recovered by 1:18 pm PT, Amazon noted, while the placement subsystem was recovered by 1:54 pm PT. By that point, S3 was operating normally.

AWS noted that it's making "several changes" as a result of the incident, including steps that would prevent an incorrect input from triggering such problems in the future.

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly," the blog explained. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

Among other notable steps AWS has already taken: It's beginning work partitioning parts of the index subsystem into smaller cells. The company has also changed the administration console for the AWS Service Health Dashboard, so that it runs across multiple AWS regions. (Ironically, the typo took out the dashboard on Tuesday, so AWS had to rely on Twitter to keep customers updated on the problems.)

VIDEO: Amazon Alexa jumps from speakers to smartphones

The 10 scariest cloud outages (and lessons learned from them)

Editorial standards