The good news is that the Internet's behaving itself today. The bad news is that by November we will see another burst of Internet slowdowns and sites becoming unavailable as we did on August 12.
Here's how it all went wrong and why we can expect it to happen again.
We now know for certain that the Internet started having fits because some Tier 1 Internet routers' Border Gateway Protocol (BGP) routing tables had grown too large. The result: These routers could no longer properly handle Internet traffic.
BGP is the Internet's most important high-level protocol. In BGP, each routing domain is designated as an autonomous system (AS). BGP is used to find the shortest routes, i.e. the ones that take the fewest AS hops, between hosts. A Tier-1 network has access to the entire Tier 1 Internet BGP routing tables. Typically, a Tier-1 network such as the big three Tier-1 ISPs — Level 3, NTT, and TeliaSonera — provide 10 to 100Gbps Internet connections to tier-2 and last-mile ISPs.
To keep track of these high-speed connections, top-level routers make a map of the routes. This map is kept in a specialized kind of memory called Tertiary Content Addressable Memory (TCAM). So far, so good. The problem is that some older routers only have enough TCAM to map 512,000 routes. Once a router runs out of TCAM, as Cisco explained in May 2014, networks could see performance degradation, routing instability, and availability problems.
Well, they got that right.
According to BGPMon, a high-level networking traffic monitoring company, there were outages for 2,587 autonomous systems. This was caused by "15,000 new prefixes introduced into the global routing table." Since the full BGP map was hovering just below 500,000, that was enough to push non-upgraded Tier 1 Cisco routers over their TCAM limit. This, in turn, started causing havoc over the entire Internet.
By BGPMon's reckoning, almost all of those 15,000 new prefixes that broke the camel's back originated from Verizon. Andree Toonk, a network engineer and BGPMon's founder, wrote, "Whatever happened internally at Verizon caused aggregation for these prefixes to fail, which resulted in the introduction of thousands of new routes into the global routing table. This caused the routing table to temporarily reach 515,000 prefixes and that caused issues for older Cisco routers."
Fortunately, Toonk continued, "Verizon quickly solved the de-aggregation problem, so we’re good for now. However the Internet routing table will continue to grow organically and we will reach the 512,000 limit soon again."
Verizon was asked to comment on this but had not replied by the time this story was published.
It's that last bit about the BGP table continuing to grow which makes me certain we'll see this problem reappear again. By October, it's estimated, we'll be over 512,000 records for once and for all.
No, the problem behind the technical problem is ignorance. Warren Kumari, a Google senior network security engineer, wrote on the North American Network Operators Group (NANOG) mailing list: "Sadly enough, not everyone knew about the issue - there are a large number of folk running BGP on [Cisco] 65xx and taking full tables who are not plugged into NANOG / the community. In many cases they are single-homed enterprise folk, but run BGP anyway, because some consultant set it up, some employee with clue did it years ago and then left, etc."
Other network engineers report that some of service providers made the necessary correction in some, but not all, routers. This leads to a "lather, rinse, repeat" cycle of trying to recover when a fixed router starts feeding routes to a not-fixed router, and the not-fixed router gets back into the state of dropping routes.
In short, ignorance certainly played a role in this problem's proliferation.
Kumari continued, that while "some network engineers did know about the issue, but convincing management to spend the cash to buy hardware that doesn't suck was hard, because 'everything is working fine at the moment.'" Some folk needed things to fail spectacular;y to be able to justify shelling out the $$$."
So, there you have it. A little incompetence, some people who simply didn't know they were running out of memory, and not enough money all lead to the Internet having a coughing fit. While some ISPs will have doubtlessly learned their lesson from this experience, I am dead-certain that we will see another round of problems appear when we finally go over the 512 barrier for once and for all.