Amazon Web Services' prolonged downtime over the weekend was caused by lightning, according to the company - but was lengthened by software bugs in the cloud provider’s infrastructure.
The bugs, disclosed by Amazon in an analysis of the failure, show how the size at which cloud providers operate can make them acutely vulnerable to failures in their software systems -- and acts as an object lesson of Google's statement that "at scale, everything breaks."
According to Amazon, its cloud in its US East region (10 datacentres split across four availability zones) was initially troubled by utility power fluctuations, probably caused by an electrical storm.
The outage took out key AWS technologies such as EC2
However, it was a variety of unforeseen bugs appearing in Amazon's software that caused the outage to last so long: for example, one datacentre failed to switch over to its backup generators and eventually the stores of energy in its uninterruptible power supply (UPS) were depleted, shutting down hardware in the region.
This unavailability of instances caused a "significant impact to many customers," Amazon said, but was worsened by degradation of the "control planes" — software that lets customers create, remove and change resources across the region. This inflexibility hobbled customers’ ability to respond to the outage.
And the problems didn't end there: a bottleneck appeared in Amazon’s server booting process, which meant it took longer than expected to bring key AWS components like EC2 and EBS back online. This led to a further problem for Amazon, as when it brings EBS back it needs to perform various technical operations to assure that data stored in the technology is preserved, and due to the number of affected bits of hardware "it still took several hours to complete the backlog."
Taken together, these difficulties combined to lengthen the recovery process beyond what it would normally be if a company were just bringing a generator back online.
"A bug we hadn’t seen before"
Perhaps the most critical problem for Amazon was the unforeseen bug that appeared in its Elastic Load Balancer (ELB), which is used to route traffic to servers with capacity.
When key AWS components like EC2 go down, the ELB system frantically tries to assign workloads to servers with space. However, as Amazon’s cloud rebooted, "a large number of ELBs came up in a state which triggered a bug we hadn’t seen before," the company said.
The bug meant that Amazon tried to rapidly scale the affected ELBs to ones of a larger size, flooding Amazon’s cloud with requests that caused a backlog in its control plane.
This, combined with a rise in the number of new servers being provisioned by customers in unaffected availability zones to add even more requests to the control plane, increasing the backlog still further.
A similar bug occurred in the recovery process for components of Amazon's Relational Database Service (RDS). Due to changes made to how Amazon dealt with storage failures, a bug appeared that meant RDS’s sharded across multiple availability zones did not complete failover, rendering them useless.
This bug is one which "only manifested when a certain sequence of communication failure is experienced," Amazon said, "situations we saw during this event as a variety of server shutdown sequences occurred."
All in all, though Amazon's outage happened because of an electrical storm, it was the hidden bugs in the cloud provider’s infrastructure that led to the real problems.
This strengthens the argument for cloud providers like Amazon to fully disclose their IT systems to either their customers or an independent third party for assessment, testing and inspection, as Yale academic Brian Ford has argued in his academic paper on "Icebergs in the clouds: the other risks of cloud computing" (PDF).
It also gives a real world example of the types of problems that he has theorised might arise in sufficiently large clouds.
Amazon has promised customers that it will "spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes," but I do not think this goes far enough. In my opinion, more eyes are better, and Amazon should consider publishing further details of its technical infrastructure so customers can have as much insight into its infrastructure as it does.