If you found your Internet speed has been pathetic today and some sites wouldn't load at all, you're not alone.
Many tier-one Internet service providers (ISPs), and in turn, the last mile ISPs they support, experienced technical problems that resulted in bad service throughout the US and some parts of Canada.
According to postings in the North American Network Operators Group (NANOG) mailing list, the professional association for Internet engineering and architecture, there have been "major problems with multiple ISPs since around 4-5 AM EST."
According to NANOG, and complaints tracker DownDetector, many Internet providers — including Level 3, AT&T, Cogent, Sprint, Verizon, and others — have suffered from serious performance problems at various times on Tuesday.
And they won't be the last.
Most of the ISPs have not commented on these disruptions. Level 3, in a statement, did say, "Our network is currently experiencing limited service disruptions affecting some of our customers. Ensuring the stability of our network and communications services is our primary concern and we are dedicated to minimizing impact to our customers. Our technicians are currently working to restore services as quickly as possible, and we are in close contact with affected customers.”
As a result of these problems, some Web hosting companies, such as LiquidWeb, and its sites have been effectively knocked offline.
The company reported on Twitter that the problem first appeared to be the result of a "large network provider is performing maintenance."
While an ISP maintenance activity may have played a factor, the real problem was that Border Gateway Protocol (BGP) routing tables have grown too large for some top-level Internet routers to handle. The result was that these routers could no longer properly handle Internet traffic.
BGP is the routing protocol used to share the master routes, or map, of the Internet. On top of this the Domain Name System (DNS) is layered so that when you click on "www.zdnet.com" you're taken to ZDNet.
"Some routing tables hit 512K routes today. Some old hardware and software can't handle that and either crash or ignore newly learned routes. So this may cause some disturbances in the Force."
By this, Vink meant that some routers have only a limited amount of memory for their maps of the Internet. These BGP routing tables are typically kept in a specialized kind of memory called Tertiary Content Addressable Memory (TCAM). Once there were more than 512,000 routes, many older routers could no longer properly track the routes.
Adding insult to injury, Internet engineers who were paying attention, knew this problem was coming was early as May. As one IPv4 address reselling site explained:
"We expect to see/hear of some bugs once the Internet reaches 512k routes. If the growth of the routing table will continue as in the past months, we expect to see 512k routes in the global routing table not earlier than August and not later than October."
Lucky us. We got there early.
Cisco also warned its customers in May that this BGP problem was coming and that, in particular, a number of routers and networking products would be affected. There are workarounds, and, of course the equipment could have been replaced. But, in all too many cases this was not done.
Still, it could have been far worse. Instead of sporadic Internet problems we could have seen entire swathes of the Internet go out of service for hours at a time.
Sources at several major tier-one ISPs admitted that the BGP routing map problem was indeed the source of the service troubles. All of them are working on correcting it as quickly as possible.
Unfortunately, we can expect more hiccups on the Internet as ISPs continue to deal with the BGP problem. In a week or two the problem should be fixed for once and for all, but as older routers are upgraded or replaced we will see more Internet blockages and slowdowns.