Google has told Cloud Engine customers to build in redundancy to their cloud apps after a major power outage hit its European datacentre.
Google said the outage at its European Google Compute Engine (GCE) datacentre was caused by four consecutive lightning strikes that hit the grid powering it last Thursday. While the company has taken full responsibility for the outage, it has also told customers who suffered downtime they shouldn't be relying on a single compute zone.The incident also caused the permanent loss of some data from 'persistent disks' -- the term it uses for a certain type of storage for virtual machine instances hosted on GCE -- in its europe-west1-b zone, located in St Ghislain, Belgium.
Google said in its analysis of the outage that between 13 August and 17 August, a small portion of the disks were sporadically returning I/O errors to their attached GCE instances as well as experiencing errors in snapshot creation. In total, five percent of its 'standard persistent disks' in the zone were experiencing the issues, however by Monday, "less than 0.000001 percent of the space of allocated persistent disks" remained affected, Google said.
While the company does have battery backup systems that kicked in automatically during the power outage, Google has admitted that older storage hardware that it had yet to upgrade was more susceptible to power failure than its newer kit.
"Some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain," it said.
"In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk," it added.
Google said it had already been in the process of upgrading that storage hardware prior to the incident and that most of its persistent disk storage is already running on the later hardware.
Following the incident it has said it will continue the hardware upgrade, implement systems to improve persistent disk resilience, and improve response procedures for engineers.
One of the last major incidents affecting GCE occurred in February when instances lost outbound network connectivity due to an error in its virtual network.
According to the European Union Agency for Network and Information Security (ENISA), bad weather and other natural phenomena was the top cause of lengthy system outages in 2013, mostly due to power outages and cable cuts.