Facebook's giant outage: This change caused all the problems

Facebook says a configuration issue knocked its social media apps offline on Monday, October 4.
Written by Owen Hughes, Senior Editor

Facebook blamed its six-hour outage on Monday on a faulty configuration change that affected its vast social media platforms and internal systems.

Facebook, alongside WhatsApp and Instagram, suffered a global outage on Monday, October 4 that began at approximately 11:44 EDT and dragged on well into the afternoon.

The social media giant's services were back online as of 17:28 EDT.

SEE: A cloud company asked security researchers to look over its systems. Here's what they found

In a subsequent blog post, Facebook's VP of infrastructure, Santosh Janardhan, said the outage had been caused by a technical issue affecting its Border Gateway Protocol (BCP) routing system, which had "a cascading effect on the way our data centers communicate, bringing our services to a halt."

Monday's outage also affected internal tools at Facebook that made diagnosing and fixing the problem more difficult, said Janardhan. According to the New York Times, the outage rendered engineers' access cards useless, meaning staff couldn't get into the buildings where the affected servers were housed.

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication," said Janardhan.

"Our services are now back online and we're actively working to fully return them to regular operations. We want to make clear at this time we believe the root cause of this outage was a faulty configuration change."

BGP was originally designed to interconnect internet service providers across the globe. It now forms the routing backbone of the internet.

Facebook also uses BGP as a foundation for its data center routing design. In a blog post published in May 2021, Facebook researchers said the routing design was aimed to allow the company to "build our network quickly and provide high availability of our services, while keeping the design itself scalable."

SEE: Why Facebook is the AOL of 2021

However, the researchers also note that BGP "requires tight codesign with the data center topology, configuration, switch software, and data center–wide operational pipeline." Ironically, Facebook's data centre routing configuration was designed specifically to minimize the impact of failures.

No user data was compromised in Monday's outage, Facebook said.

Editorial standards