When Google migrates apps between its cloud datacenters, the tech giant describes the process as normally being "graceful". A recent move, however, proved to be anything but.
The mishap, which Google has traced back to a router update and datacenter automation gone awry, triggered a nearly two-hour disruption to Google App Engine services on August 11.
The incident resulted in just over one-fifth of apps hosted on Google App Engine's US-CENTRAL region suffering from errors at a significantly higher rate than usual. Google has apologized and has implemented plans to ensure the failure doesn't happen again.
"On Thursday 11 August 2016 from 13:13 to 15:00 PDT, 18 percent of applications hosted in the US-CENTRAL region experienced error rates between 10 percent and 50 percent, and three percent of applications experienced error rates in excess of 50 percent," Google explained in a post-mortem of the incident published on Tuesday.
Although less severely affected, an even larger proportion of apps suffered from increased error rates, which end users may have experienced as sluggish load times. According to Google, "the 37 percent of applications which experienced elevated error rates also observed a median latency increase of just under 0.8 seconds per request". The remaining 63 percent of apps in the region were not impacted.
Google blamed itself for the outage, according to its root cause analysis of the event, which occurred during a routine balancing of traffic between regional datacenters, in this case US-CENTRAL. This procedure normally involves draining traffic from servers in one facility and redirecting it to another, where apps should have automatically restarted on newly-provisioned servers.
However, on this occasion, the handoff occurred while a software update on traffic routers was underway. Google didn't foresee that doing both simultaneously would overload its traffic routers.
"This update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity," Google said.
"The server drain resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance.
"Some manually-scaled instances started up slowly, resulting in the App Engine system retrying the start requests multiple times which caused a spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests.
"Although there was sufficient capacity in the system to handle the load, the traffic routers did not immediately recover due to retry behavior which amplified the volume of requests."
Google added that a few things did go as planned during the mishap. Its engineers attempted to roll-back the change within 11 minutes of noticing a problem, but it was too late to reverse the CPU overload. While App Engine automatically redirected requests to other datacenters, Google's engineers were also compelled to manually redirect traffic to mitigate the issue.
During its recovery efforts, Google engineers also discovered a "configuration error that caused an imbalance of traffic in the new datacenters". Fixing this resolved the incident at 3pm PDT.
Google has now added more traffic-routing capacity to support future load-balancing efforts and has changed its automated rescheduling routines to avoid a repeat.
"We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that we apologize," Google said.
The August 11 outage followed a several hour App Engine disruption in December that occurred while migrating the Google Accounts system to new storage equipment. That outage reportedly was behind Snapchat's brief problems in receiving new posts from users.
In April, Google blamed two software bugs in its networking gear on an 18-minute worldwide outage of Google Compute Engine, its infrastructure-as-a-service offering.