If you had trouble accessing certain apps on Monday, there's a chance it was caused by a brief but widespread outage on Google's infrastructure-as-a-service offering, Google Compute Engine (GCE).
Even small Google service outages draw attention if they affect its own products such as Gmail or Drive. But on Monday what hit a brick wall was GCE, the Google infrastructure that powers virtual servers rented out by businesses and app developers.
The outage only lasted 18 minutes, but it knocked out GCE instances in all regions. That's not good news for customers who would expect Google's multi-region datacenters to offer some failover capability.
But that option wasn't available to customers on Monday. That failing explains why Google vice president of 24x7 Benjamin Treynor Sloss on Wednesday wrote an unusually detailed account of what wrong.
"We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages," Sloss said.
"This incident report is both longer and more detailed than usual precisely because we consider the April 11 event so important, and we want you to understand why it happened and what we are doing about it."
In short, a routing issue knocked out GCE instances in all regions, as well as dependent VPNs and L3 network load balancers.
The outage didn't affect Google Cloud Platform (GCP) itself, but did hit GCP applications. Affected GCP customers will get service credits equal to respectively 10 percent and 25 percent of GCE and VPN monthly charges.
Sloss describes a cascading set of problems that occurred due to two bugs in its network-management software, which tripped up two safeguards when engineers propagated a faulty configuration across its network.
A few hours before the outage, Google's engineers removed one unused GCE IP block from the network configuration, which is usually a harmless change in itself, according to Sloss. These IP blocks allow systems outside Google's network to find GCP services.
The engineers attempted to propagate the new configuration, which Google's network-configuration software detected was problematic due to a "timing quirk in the IP block removal".
But because of the software bug in its maintenance software, instead of reverting to a known, safe configuration, it removed all GCE IP blocks from the new configuration and pushed that out to the network.
The second safeguard that failed was Google's 'canary step', which would have, in the absence of bug number two, ensured that a new configuration is constrained at one site until its proven to be safe before a broader rollout.
"In this event, the canary step correctly identified that the new configuration was unsafe. Crucially, however, a second software bug in the management software did not propagate the canary step's conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout."
The series of system errors meant that by 7.09pm PT, the time at which the outage commenced, "Inbound traffic from the internet to GCE dropped quickly, reaching >95 percent loss," Sloss said.
That's when Google engineers, who only 20 minutes earlier were looking into an issue they thought was localized to Google's asia-east1 VPN, knew they had a bigger problem on their hands. By 7.21pm, Google had issued an alert that it was experiencing "severe network connectivity issues in all regions".
Sloss said Google engineers will be working over the next few weeks on further prevention, detection and mitigation systems to bolster its production safeguards.