Amazon Web Services has blamed the severe storms that tore through Sydney last weekend for the power outage that affected a number of EC2 instances and EBS volumes within the AP-Southeast-2 region.
"Our utility provider suffered a loss of power at a regional substation as a result of severe weather in the area," Amazon Web Services said in a statement it released on its website.
As a result, AWS' Sydney service suffered connectivity issues, which saw the company working until Monday morning to recover the problem.
The company explained that in normal circumstances when a utility power fails, the electrical load is maintained by multiple layers of power redundancy; however, during the weekend, the instances that lost power lost access to both their primary and secondary power, which resulted in an "unusually long voltage sag".
"This failure resulted in a total loss of utility power to multiple AWS facilities. In one of the facilities, our power redundancy didn't work as designed, and we lost power to a significant number of instances in that Availability Zone," the company said.
Domino's Pizza, Foxtel Play, Foxtel Go, and Stan were among some of the AWS customers that were impacted by disruption.
Within an hour of the power loss, AWS reported it restored power to the facility, noting it was able to recover a majority of the affected instances, all but a small number of instances and volumes that were taking longer to recover.
The company has explained that the small number of instances and volumes that were impacted suffered failed hard drives that led to a loss of the data stored on those servers, which meant it was unable to automatically restore the volume and had to manually recover the damaged storage servers.
"This is a slow process, which is why some volumes took much longer to return to service," the company said.
AWS has acknowledged that following this event it will need to enhance its design to prevent similar outages, and plans to roll out changes to the Sydney region in July.
These changes will include adding additional breakers to allow generators to activate faster; improving the latency issue related to its recovery systems; and beginning regularly testing its recovery processes on unoccupied, long-running hosts. In addition, the company has committed to assuring that its APIs will be more resilient to failures.