CloudFlare pins outage on bad rule for Juniper routers

Summary:Content distribution firm "drops off the internet" after outage.

Content distribution company CloudFlare suffered a worldwide outage for around an hour over the weekend after applying a bad change to its Juniper edge routers that replicated across its network.

CloudFlare "effectively dropped off the internet" after a network wide failure hit all 23 nodes located in 14 countries across the world. Visitors to the site during the outage between 09.47 UTC and 10.49 UTC would have received a DNS error, CloudFlare's CEO Matthew Prince explained on the company's blog.

The outage affected its DNS and any services that rely on its web proxy, which is an important component for clients — such as WikiLeaks and around 500,000 other organisations — that rely on it for web optimisation and uptime in the face of distributed denial of service attacks.

Indeed, the outage occurred after its engineers applied a "bad rule" to a Juniper edge router while fending off a DDoS attack against one of its clients, which spread across its network of edge routers using Juniper's Flowspec protocol.

The rule was designed to filter an attack that was sending packet sizes between 99,971 and 99,985 bytes long to the client’s DNS server, but caused the router to malfunction. 

2013-03-04 01.00.48 pm
CloudFlare's networkwide outage. Credit: CloudFlare

"Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed," Prince said.

Some routers failed to automatically reboot, forcing network operations teams at the datacentres had to physically access them and perform a hard reboot to get them up and running again.

The company said it is investigating whether Juniper is aware of any bugs and will begin testing whether Flowspec rule updates can be targeted to specific datacentres rather than applied network-wide.

CloudFlare intends on issuing service credits to accounts covered by service level agreements.

"Any amount of downtime is completely unacceptable to us and the whole CloudFlare team is sorry we let our customers down this morning," said Prince.

Topics: Networking

About

Liam Tung is an Australian business technology journalist living a few too many Swedish miles north of Stockholm for his liking. He gained a bachelors degree in economics and arts (cultural studies) at Sydney's Macquarie University, but hacked (without Norse or malicious code for that matter) his way into a career as an enterprise tech, s... Full Bio

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.