Google: 'Sorry for wide-scope outage but canary testing brought our cloud down'

Google's recent cloud outage was caused by updates getting trapped at its 'canary' test deployment.
Written by Liam Tung, Contributing Writer

The outage was triggered by a "large set" of updates to its load-balancing gear,

Image: Google

A botched software update triggered last month's two-hour outage affecting Google Compute Engine (GCE) instances, cloud VPNs, and network load balancers.

While the incident wasn't as serious as a past network outage, Google had promised a full explanation due to the "wide scope" of this one, which dropped connections to all GCE instances, cloud VPN tunnels and network load balancers that were created or live-migrated on Monday, January 30.

"We apologize for the wide scope of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself," said Google's Cloud Platform engineers.

This outage was triggered by a "large set" of updates to its load-balancing gear, although the outage itself was caused by updates getting jammed during testing inside a canary deployment.

"All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated," explained Google.

"The issue was triggered by a large set of updates, which were applied to a rarely used load-balancing configuration. The application of updates to this configuration exposed an inefficient code path, which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase," it added.

The 10 scariest cloud outages (and lessons learned from them)

Google's short-term response was to increase the canary timeout phase so that if the same series of errors occurs, it will only slow network changes rather than completely stop them. Over the longer term, it plans to improve the inefficient code path.

Google has also begun work to "replace global propagation of address configuration with decentralized routing".

"This work is being accelerated as it will prevent issues with this layer having global impact," it said.

More on Google and cloud

Editorial standards