Internet hiccups today? You're not alone. Here's why

Internet hiccups today? You're not alone. Here's why

Summary: It's not just you. Many Internet providers have been having trouble as they run into long expected (but not adequately prepared for) routing table problems.

SHARE:
TOPICS: Networking, Cisco
20
internet-hero
(Image via CNET/CBS Interactive)

If you found your Internet speed has been pathetic today and some sites wouldn't load at all, you're not alone.

ActiveBGP
The Border Gateway Protocol (BGP) routing tables hit the limit, and older routers failed — taking some of the Internet with them.

Many tier-one Internet service providers (ISPs), and in turn, the last mile ISPs they support, experienced technical problems that resulted in bad service throughout the US and some parts of Canada.

According to postings in the North American Network Operators Group (NANOG) mailing list, the professional association for Internet engineering and architecture, there have been "major problems with multiple ISPs since around 4-5 AM EST." 

According to NANOG, and complaints tracker DownDetector, many Internet providers — including Level 3, AT&T, Cogent, Sprint, Verizon, and others — have suffered from serious performance problems at various times on Tuesday.

And they won't be the last.

Most of the ISPs have not commented on these disruptions. Level 3, in a statement, did say, "Our network is currently experiencing limited service disruptions affecting some of our customers. Ensuring the stability of our network and communications services is our primary concern and we are dedicated to minimizing impact to our customers. Our technicians are currently working to restore services as quickly as possible, and we are in close contact with affected customers.”

As a result of these problems, some Web hosting companies, such as LiquidWeb, and its sites have been effectively knocked offline.

The company reported on Twitter that the problem first appeared to be the result of a "large network provider is performing maintenance."

While an ISP maintenance activity may have played a factor, the real problem was that Border Gateway Protocol (BGP) routing tables have grown too large for some top-level Internet routers to handle. The result was that these routers could no longer properly handle Internet traffic.

BGP is the routing protocol used to share the master routes, or map, of the Internet. On top of this the Domain Name System (DNS) is layered so that when you click on "www.zdnet.com" you're taken to ZDNet.

When the BGP maps grow too large for their routers' memory then, as the Internet Storm Center said, "BGP is flapping."

Dutch Internet expert Teun Vink explained

"Some routing tables hit 512K routes today. Some old hardware and software can't handle that and either crash or ignore newly learned routes. So this may cause some disturbances in the Force."

By this, Vink meant that some routers have only a limited amount of memory for their maps of the Internet. These BGP routing tables are typically kept in a specialized kind of memory called Tertiary Content Addressable Memory (TCAM). Once there were more than 512,000 routes, many older routers could no longer properly track the routes.

Adding insult to injury, Internet engineers who were paying attention, knew this problem was coming was early as May. As one IPv4 address reselling site explained:

"We expect to see/hear of some bugs once the Internet reaches 512k routes. If the growth of the routing table will continue as in the past months, we expect to see 512k routes in the global routing table not earlier than August and not later than October."

Lucky us. We got there early.

Cisco also warned its customers in May that this BGP problem was coming and that, in particular, a number of routers and networking products would be affected. There are workarounds, and, of course the equipment could have been replaced. But, in all too many cases this was not done.

Still, it could have been far worse. Instead of sporadic Internet problems we could have seen entire swathes of the Internet go out of service for hours at a time.

Sources at several major tier-one ISPs admitted that the BGP routing map problem was indeed the source of the service troubles. All of them are working on correcting it as quickly as possible. 

One site, the well-regarded security service provider LastPass, appeared at first to be impacted by the problem. In the event, LastPass went down because one of its datacenters had failed.

LastPass' services have since been restored.

Unfortunately, we can expect more hiccups on the Internet as ISPs continue to deal with the BGP problem. In a week or two the problem should be fixed for once and for all, but as older routers are upgraded or replaced we will see more Internet blockages and slowdowns.

Related stories:

Topics: Networking, Cisco

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

20 comments
Log in or register to join the discussion
  • Internet hiccup or a portent for Cloud?

    All hail the Cloud...until there's a hiccup...or someone decides to cut you off...but it's OK...

    "In a week or two the problem should be fixed for once and for all."

    That's great news for all those 'savvy' businesses that drank the Cloud Kool-Aid.

    In the meantime, how's that Cloud thing working out for your business? The shareholders must be just as pleased as the staff sitting around waiting for the Cloud to come back.
    Scott W-ef9ad
    • WTF Scott W-efag

      What are you talking about? This has NOTHING to do with the "cloud". Why are you even commenting here if you have no idea what you're talking about? You're talking baseball during a football game.
      sherlockboneman
  • So, this is the future of everything, eh?

    So, this is the future of everything, eh?

    Where I'm located, our power actually went out today, so I didn't really get the chance to "enjoy" the rest of the problems the Internet was having.

    . . . and this is why I tend to opt for hybrid solutions over pure cloud solutions. At least they're somewhat functional when the Internet is down, and when I still have battery life left and the power is down.
    CobraA1
  • Fibre line cut in Toronto at Allstream

    I wonder if this had anything to do with it?

    "David Paul Federbush · Queensboro Community College
    The issue is outside of Liquid Web's network. from what we've gathered so far, there was a fibre line cut in Toronto at Allstream, and this is affecting all major providers in the U.S. but in particular, between Chicago and Toronto, and unfortunately, Liquid Web is right in the middle of those cities. The last report stated that Allstream had a technician on site working on repairs."

    http://downdetector.com/status/comcast-xfinity
    toolhead2001
    • That may have added to the problem.

      Causing large numbers of temporary bypass routes could have made the problem show up sooner in those "old" servers.
      jallan32
  • Since Yesterday?

    Those hickups are already going on for at least a week...
    Erwin.Craps@...
  • Really?

    "Cisco also warned its customers in May that this BGP problem was coming and that, in particular, a number of routers and networking products would be affected. There are workarounds, and, of course the equipment could have been replaced. But, in all too many cases this was not done."

    That's a bit like your daughter telling you that she is getting married on the other side of the world in 3 months' time and expecting everybody to be ready. If everybody had taken action, would the service companies installing the equipment have been able to cope?

    And did Cisco warn everybody individually or just put something on their website and hope you would spot it?

    I'm not technically minded so feel free to rip me to pieces for having an opinion, people usually d,o but it seems to me that businesses should have been warned a long time ago. It can take more than three months just to get hold of the equipment you need in some cases, let alone install it.
    coolcity
    • Cisco's warning:

      FYI, when one of the bigger companies in the technical arena posts a warning or message on their site, it gets a lot of attention. Sure, more lead time would have been nice, but if they posted it as a 'heads-up', all of the other players surely saw it.

      There are entire departments in tech companies that follow other companies around just for information like this; within moments of this being posted, I'll guarantee you that all the big tech companies (Oracle, MS, Google, HP, IBM, Juniper, Barracuda, name any other big tech company) knew about it, and posted a similar bulletin on their site. There's no percentage in it for any of these companies for their competitors to be caught with their metaphorical pants down; in cases like these, it's imperative that the companies work together to get these flaws addressed as quickly as possible.
      rshores
      • Just like railroads, truck lines and airlines

        I'm sure CSX on the East Coast does not gloat if a Union Pacific track across the Rockies goes down; they have customers who want to ship freight to Oregon, and customers who want to get freight from there. Freight cars go all over everyone else's tracks, and the accounting programs compute the splits (example: Microsoft ships pallets full of Surfaces to Florida, the car switches to CSX somewhere in the Midwest, probably Chicago, and UP collects from the customer and sends a fraction of the payment to CSX). Internet works the same way, only millions of times faster, so to quote Donne, no ISP is an island ... any company's downtime diminishes mine.
        jallan32
      • Cisco's warning:

        Spot on rshores, not to mention Cisco generally dedicates internal employees to large providers. When I worked for a Tampa based ISP (Bright House) we had two full time engineers from Cisco setting with the rest of the engineering staff. I can't vouch for all ISPs, but I know ComCast does this as well.
        jtmajorx
  • This isn't a BGP problem. This is an ooooooooooold equipment problem.

    I keep hearing everyone go on about how this is supposed to be a BGP problem. That could not be farther from the truth. BGP is working exactly as it is supposed to. The problem is being caused by networks that refuse to upgrade to modern equipment that has higher route scale.

    Suddenly, everyone I'm talking to today about this no longer thinks I was nuts for recommending a minimum FIB capacity of 1M routes for the past couple of years. Many newer platforms support 4M in the FIB. And no, that's not the RIB - that's the FIB. Those same platforms support 15-20M in the RIB.

    Boys and girls, it's time to refresh those ancient 7600s and 6509s with more modern platforms.
    jcostom
    • Amen JCostom

      You can push a 7600 pretty freaking far, but I agree. Time to migrate to the ASR platform, which I'm totally not going lie, no clue on FIB entries for say a 9K. So, guess I'm looking that up later.
      jtmajorx
    • RTFM - Engineering Guidelines

      Agreed. Color me a old crumudgeon but back in 1997 we were tracking BGP max routes limits and at Nortel is was a requirement of the engineering team to review this with customer prior to deployment. This comes down to the usual assortment of "shoulda done that" engineering who-hah. Too many folks to eager to make a sale and to few folks eager to go through a migration/upgrade. Fix it when it breaks.
      AlexCloudBoyMP2
  • Why so many routes?

    My experience is in programming, not networking. A little background on how BGP works and why it is growing so rapidly would have been helpful.
    technojoe
    • It is partly the nature of IPv4

      Each public network has to be routed... and that has to be entered everywhere (eventually). As public networks get smaller smaller (as in not even a class C network) the number of routes grows... I won't say exponentially, but close.
      jessepollard
    • Re: Why so many routes?

      Think of it as the branches of a tree. Every public network is a branch, or perhaps even a twig, off of it.
      Think of every house that has Internet access by land-line. Be it DSL, fiber, cable or anything else. Each provider has a virtual forest within, or perhaps more think coral-reef. All of those do an untold amount of routing, and then pass a shrunk-down version of that to the next level up. Do that a time or two, and even the shrunk-down versions get hellishly complex.
      to that add in that as mergers, buyouts, etc take place, these providers may not keep fully contiguous sets of IP's to give out (ie.. initially company A has 199.0.0.0-200.255.255.254 as a range they can use for all of their clients. But then they bought out company B, and added in 103.0.0.0-103.255.255.254. Now instead of one smaller 'batch' of routes, they have to use two, for the same things)
      Now, take that and add in the wireless networking as well which can be even more fractured.

      Depending on provider, it's literally possible for a route from a house in Town A, to Town B (just a few miles away) could have to go through 20+ different routers. And it is the router that is the midpoint, that has all the routes for that path. Now take the core-level routers, that allow country-to-country routing. which tries to track far more, and those, are the very old hardware, which basically chokes because it's RAM is full
      jonrosen
      • "...Think of every house that has Internet access by land-line..."

        Homes do not contribute to the problem usually.

        Homes are usually a single IP number (that then supports the rest of the house via NAT, which is not a public network). Now a large collection of houses might be on a subnet, but not a single house (some places are even using a 10.x.x.x address, which is also not a public network).
        jessepollard
  • Hiccup??? More like majorly flat on its back!

    I'm amazed this seems to be getting so little attention. It's been a couple of days since things got crazy, and I search for updates on what's going on with it and NOTHING! ...not a mention of it anywhere, expect for the original reports. For me, this is big - the day it happened I was without any connectivity whatsoever, and now that I can at least connect I'm finding that my "broadband" connection is slower than dial-up! I've been trying to upload a photo to a Flicker account and its died three times... timed out! I can't get the most basic things to work, due to the ridiculous connection speed. I hope something is happening behind the scenes - it would be a big help if someone in the know would at least provide some updates so some of us could get some idea what awaits us where this is concerned.
    wa1den_b@...
    • As long as IPv4 is being used, it will be a problem.

      IPv6 has better routing support, which requires fewer routes to maintain.
      jessepollard
  • FiOS customers w/ odd outages in nyc, nj, baltimore, dc, dallas, florida, l

    So, I was right - the entry at http://luxe.marketing/luxeblog/entry/nyc-metro-region-affected-by-verizon-fios-brownouts-via-alter-net#sthash.1yuoENyZ.dpbs basically is part of a larger more dramatic problem with Verizon's FiOS network. Thanks to Sam at LiquidWeb, I have been able to piece this all together, and it's not good for Verizon customers.

    http://www.zdnet.com/internet-hiccups-today-youre-not-alone-heres-why-7000032566/

    The issue that I reported to you lies within a larger issue and finally I have been able to piece it all together with network operator assistance. Apparently back in May, Cogent / L3 pushed a new routing table to all network operators / providers which accidentally exceeded the memory size footprint acceptable to most routers, causing a router meltdown and an alert from Cisco among other things. The same issue I am experiencing happened to people on nearly all Internet services, however most providers repaired their routing tables immediately. Apparently, large numbers of specifically FiOS / Verizon users, who live in several large geo-specific regions are still having severe and ongoing problems relating to the issue specifically – Balitmore, NYC/NJ, DC Metro, Los Angeles, Florida, Dallas, Atlanta and North Carolina.

    It seems that my ticketing the issue, helped Verizon to identify one of the affected routing issues within their own networking that was still un-repaired, that said the issue still was and is occurring on Verizon’s network. It has nothing to do with trace or latency or ftp timeout or my modem or whatever else – it’s basically that a large amount of routers on the FiOS network are experiencing routing issues, but Verizon doesn't know which routers need table updates, and which ones don't seemingly, and those routers being spread across the nation is another factor.

    The result of this problem, for me was I was being forced to wait, and wait, and wait until finally the network operation times out, hopefully bumping my IP transaction over to an unaffected router. Often, that did not happen - resulting in massive packet loss and minimally high latency time but more like total disapparance of an outside world - the internet.

    Now, I want free and clear out of my FIOS contract frankly, but if I cancel it because they failed to provide me internet – I’ll end up in collections with a balance and it will be on my credit. Many people are locked in this situation, and because it’s a ridiculous situation to have a contract of service violated by the provider, they should all be allowed a legal out.

    But unless there’s a legal stance on the issue, it’s part of the larger Net Neutrality thing that Verizon refuses to acknowledge a problem publicly to it’s customers - given ongoing support of anti-net neutrality, it looks really bad to me, that you support something in interest profit wise like market-value rates for faster speed of an IP transaction, meanwhile at the same time you know you’re failing to provide your customers internet access in many markets, much less internet at a speed acceptable to consider internet access at all.

    For support contact details regarding the issue, and more information visit the thread I started at the VZ forums.

    http://forums.verizon.com/t5/FiOS-Internet/FiOS-customers-w-odd-outages-in-nyc-nj-baltimore-dc-dallas/td-p/733309

    - See more at: http://luxe.marketing/luxeblog/entry/verizon-experiencing-massive-outages-in-nyc-nj-baltimore-dc-dallas-florida-los-angeles-and-north-carolina-on-fios-network#sthash.IYN5iMEt.dpuf
    luxedesignnyc