'All of this happened very fast': How a routine event brought everything down at Facebook

Facebook has a slightly mundane but detailed explanation for an exceptional outage that took down Facebook Messenger, WhatsApp, Facebook, and Instagram.

Social media needs a personal protocol

You couldn't use WhatsApp or Instagram because Facebook's data centers were completely disconnected after a seemingly mundane event caused a catastrophic outage. 

Facebook's hours-long outage on Sunday was a stark reminder of how dependent people have become on one company's data centers that power the world's biggest social networks. 

ZDNet Recommends

The best cloud storage services

Free and cheap personal and small business cloud storage services are everywhere. But, which one is best for you? Let's look at the top cloud storage options.

Read More

As ZDNet's Steven J. Vaughan-Nichols reported this week, Facebook's servers for its internet address book – a Domain Name Server (DNS) – wasn't functioning, making Facebook, WhatsApp and Instagram unavailable for reasons that most of its two billion users won't understand. 

SEE: Why Facebook is the AOL of 2021

Facebook's DNS was broken because Border Gateway Protocol (BGP) routes into Facebook's sites were failing. DNS translates words like 'Google' into numerical internet or IP addresses, which BGP 'advertises' to the internet, allowing PCs and smartphones to connect to websites.   

The social media giant has now offered a more detailed account of what caused the world's biggest messaging system to vanish for hours on Sunday. The incident highlights how Facebook itself has become a single point of a failure for global messaging. 

The incident itself, however, has a boring explanation. Facebook's labyrinthine networks cracked because a "routine" maintenance job went awry in a way that its networks and data centers weren't built to handle. It ultimately caused a "complete disconnection" between Facebook data centers and the internet, which made Facebook, WhatsApp and Instagram inaccessible. 

"This outage was triggered by the system that manages our global backbone network capacity," explained Santosh Janardhan, vice president of engineering at Facebook in a blogpost titled "More details about the October 4 outage".

"The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers," he continued.

"This was the source of yesterday's outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally."

The outage revealed how reliant people are on the reliability of Facebook's infrastructure and follows its efforts to merge WhatsApp, Facebook Messenger, and Instagram messaging.

SEE: A cloud company asked security researchers to look over its systems. Here's what they found

It also happened as Facebook was being scrutinized at a Senate inquiry over ethics and its handling of misinformation on its platforms, which followed a leak of internal documents published by the Wall Street Journal last month revealing among other things that Facebook knew Instagram made body-image issues worse for one in three teenage girls.

Janardhan divulged that Facebook's infrastructure wasn't equipped to deal with the speed of events that transpired. 

"All of this happened very fast," he admits. 

"To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection," he explains.  

"In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers."