Amazon has told its EC2 customers in Europe some of them could face outages of as long as 24 to 48 hours as the cloud provider struggles to recover from a lightning strike that disrupted power supplies to its Dublin, Ireland data center. It took 3 hours to recover the first of the affected instances last evening European time (midday Pacific time) and after almost 12 hours a quarter still remained offline, with knock-on effects slowing their likely recovery time. From Amazon's status page (12:08am PDT update):
"Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We've been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot."
The outage struck servers in one of three availability zones in the EU-WEST-1 region, but recovery efforts have had knock-on effects to capacity in the other two zones. Relational Database Service (RDS) is also badly affected. EU-WEST-1 is Amazon's only data center in Europe, which means that customers who have to keep their data within the European region for data protection compliance have no available failover to another Amazon location.
How the outage happened, from Amazon's status page history:
"We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We've now restored power to the Availability Zone and are bringing EC2 instances up. We'll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT."
In what seems to be a typical pattern when Amazon experiences large-scale outages, its customers have been complaining of insufficient information coming out to help them recover. "With AWS it is more a process of figuring it out through trail and error with little or poor feedback from Amazon," wrote one poster to a thread about the outage on its discussion boards. "I hope they get the remaining instances up but from their service dashboard it says 24-48 hours. This can can totally ruin my company."