Why the Internet hiccuped

Why the Internet hiccuped

Summary: And, why it will hiccup up again soon.

SHARE:
21

The good news is that the Internet's behaving itself today. The bad news is that by November we will see another burst of Internet slowdowns and sites becoming unavailable as we did on August 12.

New-BGP-Prefixes-spike
This spike in new BGP routes from Verizon is what caused the Internet to hiccup on August 12.

Here's how it all went wrong and why we can expect it to happen again.

We now know for certain that the Internet started having fits because some Tier 1 Internet routers' Border Gateway Protocol (BGP) routing tables had grown too large. The result: These routers could no longer properly handle Internet traffic.

BGP is the Internet's most important high-level protocol. In BGP, each routing domain is designated as an autonomous system (AS). BGP is used to find the shortest routes, i.e. the ones that take the fewest AS hops, between hosts. A Tier-1 network has access to the entire Tier 1 Internet BGP routing tables. Typically, a Tier-1 network such as the big three Tier-1 ISPs — Level 3, NTT, and TeliaSonera — provide 10 to 100Gbps Internet connections to tier-2 and last-mile ISPs.

To keep track of these high-speed connections, top-level routers make a map of the routes. This map is kept in a specialized kind of memory called Tertiary Content Addressable Memory (TCAM). So far, so good. The problem is that some older routers only have enough TCAM to map 512,000 routes. Once a router runs out of TCAM, as Cisco explained in May 2014, networks could see performance degradation, routing instability, and availability problems.

Well, they got that right.

According to BGPMon, a high-level networking traffic monitoring company, there were outages for 2,587 autonomous systems. This was caused by "15,000 new prefixes introduced into the global routing table." Since the full BGP map was hovering just below 500,000, that was enough to push non-upgraded Tier 1 Cisco routers over their TCAM limit. This, in turn, started causing havoc over the entire Internet.

By BGPMon's reckoning, almost all of those 15,000 new prefixes that broke the camel's back originated from Verizon. Andree Toonk, a network engineer and BGPMon's founder, wrote, "Whatever happened internally at Verizon caused aggregation for these prefixes to fail, which resulted in the introduction of thousands of new routes into the global routing table. This caused the routing table to temporarily reach 515,000 prefixes and that caused issues for older Cisco routers."

Fortunately, Toonk continued, "Verizon quickly solved the de-aggregation problem, so we’re good for now. However the Internet routing table will continue to grow organically and we will reach the 512,000 limit soon again."

Verizon was asked to comment on this but had not replied by the time this story was published. 

It's that last bit about the BGP table continuing to grow which makes me certain we'll see this problem reappear again. By October, it's estimated, we'll be over 512,000 records for once and for all.

This is not, by the by, a problem related to our running out of Internet Protocol version 4 (IPv4) addresses, as some have speculated. (We are running out of IPv4 addresses. Most of North America is down to its last free IPv4 addresses. By mid-February 2015, the IPv4 cupboard will be bare.)

No, the problem behind the technical problem is ignorance. Warren Kumari, a Google senior network security engineer, wrote on the North American Network Operators Group (NANOG) mailing list: "Sadly enough, not everyone knew about the issue - there are a large number of folk running BGP on [Cisco] 65xx and taking full tables who are not plugged into NANOG / the community. In many cases they are single-homed enterprise folk, but run BGP anyway, because some consultant set it up, some employee with clue did it years ago and then left, etc."

Other network engineers report that some of service providers made the necessary correction in some, but not all, routers. This leads to a "lather, rinse, repeat" cycle of trying to recover when a fixed router starts feeding routes to a not-fixed router, and the not-fixed router gets back into the state of dropping routes.

In short, ignorance certainly played a role in this problem's proliferation.

Kumari continued, that while "some network engineers did know about the issue, but convincing management to spend the cash to buy hardware that doesn't suck was hard, because 'everything is working fine at the moment.'" Some folk needed things to fail spectacular;y to be able to justify shelling out the $$$."

So, there you have it. A little incompetence, some people who simply didn't know they were running out of memory, and not enough money all lead to the Internet having a coughing fit. While some ISPs will have doubtlessly learned their lesson from this experience, I am dead-certain that we will see another round of problems appear when we finally go over the 512 barrier for once and for all. 

Related stories:

Topics: Networking, Cisco, Verizon

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

21 comments
Log in or register to join the discussion
  • Ahhhh! It's Y2K all over again!

    *runs around room, hands waving, panicking* The Sky is falling! The Sky is falling!
    Oh wait... no.
    If it's truly that bad, it will take only a short time before the specialists figure out a way to cordon off these older routers, and then the people that own them won't have access until they're upgraded.
    On related news, CenturyLink will probably raise my rates again, citing this as the cause, whether it really is or not...
    Zorched
    • The Sky Must Fall First

      This was only a hiccup. Kind of like shot across the bow. Nothing will be done until the sky really falls this time. As noted in the article, management will not come up with the cash until it really falls. Similarly if you start blocking older routers before it falls then there would be all kinds of liability issues. You need a serious fall before this can happen.

      For now we all need to be sure our business can stay up without the web for a couple of days. That may no longer be possible. At least be ready for a lot of screaming and hollering in the next few days. I will bet some IT people loose there jobs for something totally beyond there control.
      MichaelInMA
    • The Fallacy of Y2K being no big deal

      The only reason Y2K had little impact is due to all the attention it received well prior to the date of impact. I spent 2 years as part of a government programming team that ferreted out date calculations that would have incorrectly decided how much time had passed that would translate those calculations to $$$. Believe me, if it hadn't been remedied, you would have noticed. There were thousands of other instances just like this that teams of people worked on to make sure everything was set come 1/1/2000. Sure, there were fears of things in vehicle ECUs not working properly that never came to fruition, and then there was Dick Van Patten doing infomercials for Y2K survival kits to ride out the pending apocalypse (why no one has converted those VHS tapes and put 'em up on YouTube is frustrating, as they would make for A grade entertainment now). That stuff make fun of all you want, but you can not label all of Y2K as a hoax.
      ejhonda
      • Y2K problem

        I agree with ejhonda. And tehre were some problems in the public domain in the UK, like some ATMs not working.

        FWIW and it may sound trivial, but my Lotus 1-2-3 spreadsheet app had date problems.
        DAS01
      • Y2K = intercepted asteroid

        Exactly; the Y2K bug was not the predicted disaster because the entire IT economy was tooled up and "overstaffed" to patch software belatedly. It would have been cheaper if more systems were upgraded in the 1970s, but at least the worst symptoms predicted got the last minute budgets to patch it just in time. Some developers were smarter; an online loan company app developed in the 1970s allowed for dates up to the 2030s, because they wrote loans and sales financing contract that were that long; others had "death march" coding parties in the last six months.

        It's like the movie asteroid that gets deflected or blown up at the last minute; no one was hurt BECAUSE somebody said the sky was falling, and WORKED to hold it up.
        jallan32
    • Y2K Was a Very Big Deal

      I don't know where you were working but Y2K had a lot of people out there fixing code and reformatting data. That is the only reason that we didn't have a disaster. So, running around yelling "The sky is falling!' is a necessary issue in the IT world. If we didn't do that, Y@K would have been a very large meltdown.
      hforman@...
  • Weeks not days

    or maybe months but inevitable. !Correction!
    MichaelInMA
  • Close but..

    This IS indirectly related to the shortage of IPv4 addresses since it is one of the underlying causes of the amount of sub-netting that has occurred and therefore an increase in the size of routing tables far beyond what was predicted.

    Secondly, carriers that have the full routing tables go well beyond just the Tier 1 carriers. Many smaller carriers carry the full tables, often it depends on how many upstream transit carriers they employ. Any of those carriers would likely have had this "hiccup" on their routers if:

    And...it is not the amount of TCAM in the routers, but the default allocation setting of that TCAM memory in almost all Cisco routers, old and new, that are set to the 512,000 routes. There is usually that much available for the IPv6 routes as well and it's adjustable. ISPs knew for quite some time that this was coming and this hiccup never needed to happen.
    All that was needed was a simple adjustment from the IPv6 TCAM allocation (Only about 20k routes are needed for that currently) to the IPv4. A reboot of the router is required afterwards but the carriers had months to take care of it during maintenance intervals.
    phantos
    • Comments so far...

      Of all I see, this one seems to have the most value, and implies that proper training and one reboot would keep us going for years to come. Is it that easy? If so, why didn't SJVN mention it in the article? Of course, IPv6 will eventually need that TCAM, but how soon will that happen?

      Someone please verify; this stuff is above my pay grade ;-)
      ClearCreek
  • Gone

    It will be nice when the internet craps out and we can get back to reality. Goes for cell phones also. Oh to have the 60's back again.
    Richard_Bruler
    • Back to Cross pens, yikes!

      Afford me the opportunity to at least reach back into the 80's to sync my contacts to a Sharp Wizard or Pilot Pilot. Might be able to recall some rusty Graffiti skills via muscle memory :)
      Tired Tech
    • Irony intended?

      How would you complain about the internet if it went away?
      harry_dyke
  • Hiccuped mispelled - should be Hiccupped

    .
    gpopkey@...
  • Mispelled misspelled - should be Misspelled

    .
    HABAR
  • Why the Internet hiccuped

    bgp is supposed to be for external routing only, normally between large enterprises AS or between large entereprises' dispersed AS. it was never intended to route small networks. it was a problem before even when the internet was small due to lack of experience (early learning curve) dealing with very large scale world-wide network. although tighter route aggregation might help for some time, the need to upgrade these bgp backbones, as noted in the article, should be priority number one since the internet is growing at tremendous clip.
    kc63092@...
  • I've been waiting for the Tier 1 services

    to take a lesson from Verizon and other "last mile" deliverers, and start charging THEM for "fast lane service" at a premium cost,. Verizon, et al get all THEIR data from Tier 1 under net neutrality. Why is that? Why not have every service right up the line charge downstream for faster and more dependable routing? Analyze THAT, FCC / Mr. Wheeler.
    I2k4
  • SJVN is at his best

    When he writes about networking. Much better than his desktop linux and Chromebook shill articles.
    otaddy
    • Oh, please!

      Sometimes I wish the words "shill" and "fanboy" (and all of its intentional misspellings) would be banned from tech sites. The accusations get to be quite wearisome (I don't care whose side you are on).
      tr7oy
  • Internet problems

    I don't remember any sites being unavailable on Tuesday. If this is a problem of routers not having enough of this TCAM, can they not simply be upgraded?
    OmegaWolf747
    • In some cases it will require a replacement (with downtime)

      In others to add memory requires testing... and longer downtime.

      These routers are not cheap... and neither is the memory.
      jessepollard