How one bad line of code shut down UK air traffic for an hour

Shortly before the Christmas vacation break got under way, a single line of bad code at the UK's national air traffic control center left thousands of people grounded for days.

A view from one of NATS' control towers (Image: NATS Holdings)

For just shy of an hour on a relatively quiet Friday afternoon, Britain's airspace fell into turmoil.

The air traffic system had crashed. Dozens of planes were circling overhead, and hundreds of aircraft were grounded at their gates. Tens of thousands of passengers were waiting patiently for news.

Read this

Meet the shadowy tech brokers that deliver your data to the NSA

These so-called "trusted third-parties" may be the most important tech companies you've never heard of. ZDNet reveals how these companies work as middlemen or "brokers" of customer data between ISPs and phone companies, and the U.S. government.

Read More

What happened? A single line of bad code drove a crucial flight plan system offline.

The problem with the air traffic system is (like you would expect on any other mode of transport) the smallest of problems can cause significant delays. In the skies, that can lead to days of backlogs and other issues. It would take about a week for Britain's major airline hubs to get back up to full-speed.

On Monday, the UK Civil Aviation Authority published an interim report into the failure that led to the collapse of the British skies on December 12. Though a full and thorough report is expected later in mid-May, the root cause of the issue was almost entirely down to the faulty software.

NATS, the aerospace firm that operates the UK's air traffic, suffered its worst public relations day. The firm's chief executive Richard Deakin was quick to quell fears on BBC News that afternoon.

"The problem was when we had additional terminals coming into use, we had a software problem that we haven't seen before," he explained, "which resulted in the computer that looks after flight plans effectively going offline."

"The good news is that everything came back 45 minutes later," he confirmed.

Deakin said the "backup plan went into action," and the skies were "kept safe."

But the stark admission came when he explained that, out of the 50 different systems at its main operations center running four million lines of code in Swanwick, there was one single line of that code to blame.

Claiming the problem had been "rectified," he said the problem will not reoccur.

Here's what the Civil Aviation Authority's interim report said:

The flight plans used by an aircraft's pilots are routed to a "system flight server," which has a shared resource limit to prevent its overloading.

The maximum so-called "atomic function," which ensures the right flight plans are sent to the right place, was defined in two places with different values.

One of the controllers pushed the "select sectors" button, which puts the workstation into "watching mode" -- essentially, allowing one workstation to view what's being displayed on another workstation. When this happened, the primary system flight server thought it had more active atomic functions than the hard-coded maximum capacity.

In such an event, the system flight server is designed to shut down to prevent the risk of supplying wrong data to a controller's workstation. (Nobody wants two planes to go off-course or worse, collide in mid-air or crash into the ground.)

The backup system flight server that was running the same code kicked into action, but the controller put the workstation back in "watching mode," triggering the same error.

The report said for "the first time in the history" of the system flight server, both the active and backup systems failed at the same time.

That single line of code, Deakin said, was to blame. He confirmed it had been present in the systems since the 1990s. Deakin confirmed the company is "investing a huge amount" in new technology to bring the systems up to speed with its European counterparts.

"Over the next five-years, we're going to be moving towards internet-based systems which are very modern, and much more resilient than the systems we currently use," he explained.