Google: Here's what caused Sunday's big outage

An incorrect server configuration change strangled network capacity in multiple regions.

Google opens second Japan cloud region Located in Osaka and part of the company's US$47 billion global investment, Google's latest cloud region is the second in Japan and seventh in Asia-Pacific, and supports a clientele that includes Asahi Group, Sharp, Yamaha, and Mercari.

Google has provided some details about what caused the massive outage on Sunday that affected major tech brands that use Google Cloud as well as Google's own services, including YouTube, Gmail, Google Search, G Suite, Google Drive, and Google Docs. 

Benjamin Treynor Sloss, Google's VP of engineering, explained in a blogpost that the root cause of last Sunday's outage was a configuration change for a small group of servers in one region being wrongly applied to a larger number of servers across several neighboring regions. 

That error then caused those regions to stop using more than half their available network capacity.   

SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)

The impact was severe for high-bandwidth platforms like YouTube but less so for low-bandwidth services like Google Search, which saw only a brief increase in latency. 

"Overall, YouTube measured a 10 percent drop in global views during the incident, while Google Cloud Storage measured a 30 percent reduction in traffic," said Sloss. 

"Approximately one percent of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn't receive or send email."

The Google Cloud status dashboard states that Google Cloud Networking was experiencing network congestion in eastern USA, affecting Google Cloud, G Suite, and YouTube. The disruption lasted four hours, with the issue resolved at 4pm Pacific Time.

Sloss explained that the regions operating with constrained capacity became clogged up as they attempted to cram inbound and outbound traffic into what capacity remained. 

"The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam," he noted. 

Google Cloud WhitepapersInside the API Product Mindset | Guide to Data Analytics and Machine Learning The Future of Cloud Computing | A Faster Path to the Cloud TechRepublic

And while Google's engineers detected the issue "within seconds", it took "far longer" than its target of a few minutes to remediate the problem, in part because the network congestion hampered engineers' ability to restore the correct configurations. 

Additionally, as one Google employee explained in a HackerNews post, the disruption took down internal tools that Google engineers had been using to communicate with each other about the outage.

Sloss's post isn't the full post-mortem report the company has promised to provide customers because that investigation is still underway and aims to uncover all the contributing factors behind the network capacity loss and the slow restoration.

More on Google and cloud