Well, wasn't that fun? For about two hours, Google and its main services, such as Gmail, Google Docs, and YouTube were down for US East coast users. So, what happened?
It was not, as many feared, because Google had faltered under the load of coronavirus traffic. Urs Hölzle, technical infrastructure at Google Cloud senior vice president, said, "We're very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: This wasn't related to traffic levels or any kind of overload, our network is not stressed by."
According to the Google Cloud Status Dashboard:
Some of our users experienced a service disruption today, as a result of a significant router failure at 08:18am Pacific in one of our data centers in Atlanta, causing network congestion. As a result, Google services running in that data center were directly impacted and were unavailable until our engineers rerouted the traffic and moved those services to alternate facilities. Users in the South Eastern US may also have seen temporary difficulties in accessing a wider range of Google services due to the network congestion.
A little over half-an-hour later, "The majority of directly impacted Google services were moved to alternative data centers by 8:50am Pacific, and networking impact was mitigated by 9:21am Pacific, with some services taking longer to recover."
Needless to say, Google is "working on mitigating the issue and taking steps to avoid a recurrence."
ThousandEyes, an internet and networking monitoring company, reported that Google's explanation holds water.
For about 20 minutes (15:35 - 15:55 UTC), East Coast users couldn't reach Google services due to a 100% traffic loss. That exactly aligns with the reported root cause of an Atlanta-based router failure.
The effect of the router crash was felt far outside the East coast. Other users throughout the US were also impacted. Even Google's main search site was affected. These users saw intermittently returning HTTP 500 server errors. "These errors are consistent," Angelique Medina, ThousandEyes's Director of Product Marketing, said "with an inability to reach the backend systems necessary to correctly load various services. Any traffic traversing the affected region -- connecting from Google's front-end servers to backend services -- would have been impacted."
This explains why even users on the West coast saw some service failures.
As we worry about just how much of a load the internet can take as many of us move to working from home and video-conferencing replacing meetings, this is a worrisome reminder that the internet is not as stable as we'd like. Yes, this particular instance didn't have anything to do with the coronavirus. But, if all it takes is one major router failing to knock out Google for tens of millions of users, that's worrisome. And, these days, we don't need any more worries.