US internet service provider CenturyLink has suffered a major technical outage on Sunday after a misconfiguration in one of its data centers created havoc all over the internet.
Due to the technical nature of the outage -- involving both firewall and BGP routing -- the error spread outward from CenturyLink's network and also impacted other internet service providers, ending up causing connectivity problems for many more other companies.
The list of tech giants who had services go down because of the CenturyLink outage includes big names like Amazon, Twitter, Microsoft (Xbox Live), EA, Blizzard, Steam, Discord, Reddit, Hulu, Duo Security, Imperva, NameCheap, OpenDNS, and many more.
Cloudflare, which was also severely impacted, said CenturyLink's outward-propagating issue led to a 3.5% drop in global internet traffic, which would make this one of the biggest internet outages ever recorded.
According to a CenturyLink status page, the issue originated from CenturyLink's data center in Mississauga, a city near Toronto, Canada.
The telco says the root cause of the incident was an incorrect Flowspec announcement.
Flowspec is an extension for the BGP protocol that allows companies to use BGP routes to distribute firewall rules across their network. Flowspec announcements are usually used when dealing with security incidents, such as BGP hijacks or DDoS attacks, as it allows companies to change their entire network to react and mitigate attacks within seconds.
However, CenturyLink said that its Mississauga data center sent out an incorrect Flowspec announcement that effectively prevented the company's BGP routes from taking root.
Cloudflare, which observed the incident from afar, believes CenturyLink effectively put its entire network into a loop by announcing a brand new set of BGP routes and then accidentally dropping all routes via the misconfigured Flowspec rule.
BGP routes are the glue that keeps the internet up. They are a type of message that internet companies relay between each other. BGP routes tell each internet provider which chunk of IP addresses is available on its network.
However, as CenturyLink's incorrect Flowspec command brought down some of the routers inside its network, some of those routers also began to announce incorrect BGP routes to other "Tier 1" neighboring internet service.
This, in turn, brought down other networks in a domino-like effect.
CenturyLink fixed the issue by taking the rare step of telling all other Tier 1 internet providers to de-peer, and ignore any traffic coming from its network. Companies rarely take these kinds of decisions, as this results in full connectivity loss for all its customers.
wow wow, that must have been one of the biggest Internet wide outages in a while.. @CenturyLink asking other "tier1"s to de-peer.. that shows how bad it must have been, inability to recover.
— Andree Toonk (@atoonk) August 30, 2020
Customers dropping their peering with 3356, but routes not being withdrawn.. #ouch
On L3/CTL’s request, we’ve disabled all peering sessions until the situation is under control. Great to see industry-wide cooperation at what is undoubtably a hard time for AS3356. https://t.co/lbr38IHhyi
— Johan Gustawsson (@Gustawsson) August 30, 2020
All in all, CenturyLink had to reset all equipment and start with clean BGP routing tables, a process that took almost seven hours to complete, from around 12:13 UTC to 18:58 UTC, the company said.
"This was a significant global Internet outage," said Matthew Prince, co-founder & CEO of Cloudflare, in his analysis of the outage.