On Thursday, retail turned cloud giant Amazon suffered an outage to its Amazon Web Services in a North Virginia datacenter.
Many popular websites, including Quora, Hipchat, and Heroku --- a division of Salesforce --- were knocked offline for hours during the evening hours. Even Dropbox stumbled as a result of the outage.
Amazon was quick to detail what had gone wrong, when, and roughly why in a feat of transparency rarely seen by cloud providers, with the exception of perhaps Google.
Only a few days later, Amazon explained the cause of the fault --- which hit its Elastic Compute Cloud (EC2) service --- was no other than a power failure.
For those whose browser doesn't speak RSS, Amazon explained:
"At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power."
And then an epic feat of bad luck kicked in, as one of the vital power generators checked out:
"At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity).
Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power."
Hacker News readers were quick to point out some of the flaws in the logic. One suggested while Amazon had a "correct setup" of generator fallback rather than a battery solution, it failed in the testing department.
And then it got awfully geeky, terribly quickly.
The power failed: it's as simple as that. That should be blame-game reason number one. Yet number two, three and four --- and to the nth degree --- should be blamed on poor testing and a failure to test the series of backup power systems.
At least Amazon had the guts to flat-out admit it. One thing prevails over all others: Amazon kept its customers in the loop, which says a lot compared to a lot of other major cloud providers --- naming no names.