'Screaming car wreck' of internet routing needs a fire brigade: Geoff Huston

After 30 years, this 'massively distributed system that relies on the propagation of rumours' seems unfixable, says APNIC's chief scientist, but digital signatures are starting to help.

Border Gateway Protocol (BGP) is the system used to route traffic around the internet, and it's terrible -- despite decades of global efforts to improve its security.

According to Geoff Huston, chief scientist with the Asia Pacific Network Information Centre (APNIC), BGP is a "screaming car wreck" with "phenomenal insecurity".

"I actually don't think it's a fixable car wreck," Huston told the organisation's twice-yearly conference in Chiang Mai, Thailand on Tuesday.

"BGP is a protocol that dates back to the Bellman-Ford algorithm of 1963. It's older than the moonshot. It's getting on to 60 years," he said.

BGP has certainly been at the core of some serious incidents over the decades.

Back in 2008, Pakistan was attempting to censor videos on the internet when it accidentally knocked YouTube offline globally. Something similar happened in 2014 when Indian telco Bharti Airtel took down Google services.

Other incidents seem less accidental. In June this year, for example, a large chunk of European mobile traffic was rerouted through China for two hours.

One key problem is that BGP relies on everyone telling the truth.

Internet routers use BGP to "advertise" which parts of the internet they can send traffic to, and how efficiently they can do it. Other routers do the same. They're all meant to pass this chatter onto their neighbours without altering it.

When the information becomes outdated -- for example when an internet link fails -- routers are meant to advertise a so-called "withdrawal". They're meant to pass on that information truthfully too.

"What you find in a large complex BGP mesh is the withdraws and the updates tend to struggle against each other. And a single update event at source might become 20 updates and then a withdrawal," Huston said.

Routers out in the broader internet, unlike those that connect end users' edge networks to their ISP, usually don't have a default route programmed in. They exist in what's called the Default-Free Zone, where they're totally reliant on BGP to tell them where to send traffic.

"Any massively distributed system that relies on the propagation of rumours, where every part of that propagation can be altered on a hop-by-hop basis, if you believe that you can secure that and every aspect of its operation, both withdrawals and updates, then I would love to hear what your answer is," Huston said.

Routing engineers do a "wacko job"

Another problem is that network engineers set up BGP in ways that Huston says are a "wacko job".

"Things that I would regard as being grievous anomalies and absolute contraventions of the protocol, you guys think are normal," he said.

The aim of these weirdnesses are usually to improve network efficiency, or to distribute traffic across an organisation's infrastructure. Sometimes it's a commercial decision to send traffic through less expensive links.

But it's difficult to distinguish between these deliberate weirdnesses from genuine mistakes or malicious activity.

"These are deliberate things and they're not actually bad. They're quite normal because that's the way you do this. So what's abnormal? What's the lie amongst all that weird behaviour that you seem to think is fun?," Huston asked.

"BGP is incredibly noisy and incredibly unstable. Now spot the anomaly. And don't forget too that the best attack lasts for 15 seconds. The best attack is so fast you don't even notice it," he said.

Job Snijders, IP development engineer with NTT Communications, says that some of the techniques used by so-called BGP optimisers can create problems for other network operators if an organisation's internal routing weirdnesses leak onto the global internet.

"The reality of having installed such appliances is that you may be a ticking time bomb without realising it," Snijders told the conference, even though there are legitimate reasons to use them.

"These BGP optimisers are how you can take entire countries offline."

Snijders says the Default-Free Zone should be treated like a natural resource such as a river, and routing weirdnesses should be handled like toxic chemicals.

"Problems that happen upstream have negative consequences downstream," he said.

"[The Default-Free Zone] it something all of us share. It facilitates all of our businesses. We make money using this shared resource, but we also together have to take care of the resource."

Call in the BGP fire brigade

Huston says that given BGP's inherent problems, we need to approach all this in another way: contain the issue by detecting anomalies quickly.

"It's a bit like the fire brigade. If you keep on making houses that burn, we'll set up a fire brigade to put them out when they're burning," he said.

"We can't stop you trying to burn down your house, but we can stop the mess afterwards being as bad."

That's not going to be easy, however.

Machine learning is not the answer

"You're trying to detect fast-running, rapidly-moving anomalies inside an environment that generates by default fast-running rapidly-moving anomalies," Huston said.

"So it's a challenge," he said, though he did present a number of mathematical techniques for reducing the computation required by what are essentially brute-force processes.

Huston is also highly sceptical of using machine learning techniques.

"There's a huge amount of computing sins and transgressions encompassed by those innocent two words 'machine learning'," he said.

"In general, if you're a research funding agency you're used to hearing this -- that and the word blockchain. And if you apply for research grants, you're used to using these words -- and blockchain -- because that's what gets you money. But on the whole I'm not a big fan of this."

Huston says that when you look into most machine learning systems, you find "some kind of n-dimensional parametric analysis", where the legitimate and erroneous objects tend to form clusters.

"Hell, you don't need to understand it. Just feed it into a cluster tool. There are lots of them around. And the theory goes that if you get your parameter right, all the outliers naturally group themselves going 'Hey I'm a lie'," he said.

"Now I believe in unicorns as well. And I believe in all kinds of things, including Father Christmas and the Easter Bunny."

Huston also pointed out the limits of using an Internet Routing Registry (IRR), a database of internet route objects.

One problem is that IRRs accumulate out-of-date or badly formed information.

IRRs by design are logbooks and whatever goes in them usually stays there, said Anurag Bhatia from Hurricane Networks.

His research showed that filtering new BGP route information against the data registered in the IRRs does "not [work] so well".

At the time he conducted the research, 758,313 route prefixes were visible in the global routing table, counting both IPv4 and IPv6 networks.

Out of those, 603,185 (79.54%) had valid route objects in the IRRs. Some 58,587 (7.73%) had no valid route object, and the remaining 96,514 (12.72%) had mismatched route objects.

This means that if a router had filtered all the BGP information it received against the IRR databases, more than 20% of the routes in the global routing table would be filtered out, making those networks unreachable.

Both Bhatia and Snijders encouraged network operators to help clean up the IRRs and to start using digital signatures to ensure their routing information is authenticated.

Network operators should help others protect you by creating Resource Public Key Infrastructure (RPKI) Route Origin Authorisations (ROAs) for your own network space, Snijders said. This authenticates the information being added to the IRRs.

Protect yourself and others by deploying RPKI-based BGP Origin Validation, he said.

Disclosure: Stilgherrian travelled to Chiang Mai, Thailand, as a guest of APNIC.

Related Coverage

Amazon, Facebook internet outage: Verizon blamed for 'cascading catastrophic failure'

Cloudflare loses 15 percent of traffic due to an error at Verizon.

BGP attacks hijack Telegram traffic in Iran

With so many users in Iran, it's unsurprising that potentially state-sponsored groups would want an access point into the banned app.

For two hours, a large chunk of European mobile traffic was rerouted through China

It was China Telecom, again. The same ISP accused last year of "hijacking the vital internet backbone of western countries."

Some internet outages predicted for the coming month as '768k Day' approaches

768k Day expected within the month, reminiscent of 512k Day when AT&T, BT, Comcast, Sprint, and Verizon all went down.

Internet experiment goes wrong, takes down a bunch of Linux routers

Routers running FRR impacted in first experiment test run. Some ISPs in Asia and Australia affected the second time.