Last week's technology failure at a major FAA facility caused air traffic delays throughout the country and highlighted the agency's poor computing practices. Unlike major corporations and utilities, the FAA operates its air traffic control system with minimal redundancy using a "fix-on-fail" policy.
Redundancy is the foundation concept behind business continuity planning (BCP), which involves creating logistical and operating plans designed to take effect after a major disaster or critical infrastructure disruption. According to the Associated Press, the FAA maintains less redundancy than water or power utilities:
Redundancy is so critical for power and water utilities that they can be fined hundreds of thousands of dollars a day if they're found insufficiently prepared — and $1 million per day if they're found to be willfully negligent.
"If this (FAA outage) happened at a power plant," [according to security researcher, Jason Larsen,] "I'd be telling them to open up their checkbook and expect to be fined."
The Associated Press article points out pitfalls of the fix-on-fail policy:
"[I]t's the whole `don't fix it if it ain't broke' thing," said Branden Williams, director of a unit of VeriSign Inc. that assesses the security of retailers' payment systems. "It's unfortunate because it's very reactive, and it typically winds up costing you more. If you do fix-on-fail, it usually costs you more."
The AltuisIT blog discusses this same issue:
To reduce their total cost of ownership, industry-leading organizations know that IT systems need to be properly managed and maintained. The “Fix on Fail” approach to systems management results in employee frustration, missed deadlines, increased costs, and lower levels of customer service.
THE PROJECT FAILURES ANALYSIS
The FAA must manage it's resources and infrastructure within strict budget limitations. By implementing a fix-on-fail policy, which the agency must have decided years ago, the FAA made three bets:
- Passenger safety would not being jeopardized
- The system would not likely fail on a regular basis
- Taxpayers would not accept the costs associated with greater redundancy
In other words, sometime in the past, the agency decided the hassles and risks of the current system were acceptable, given the high cost of alternative policies.
The current situation has focused attention on the FAA and its technology policies. The Wall Street Journal reports the agency is currently engaged in a massive system upgrade, however the article doesn't provide much detail:
The Federal Aviation Administration said it is overhauling an error-prone computer system that caused hundreds of delayed flights Tuesday.
The system is part of the aging infrastructure that guides air traffic, which the FAA has been trying to update to reduce chronic delays.
Although the agency must manage to a limited budget relative to its large mandate, one wonders whether sufficiently good judgment, and good practice, is being applied to FAA technology decisions.
It's important to note the FAA consistently states that passenger safety is not compromised by its computing practices.
As an aside, here's an interesting FAA-related story.
Some years ago, I happened to drive by the Boston air traffic control center for the northeast, which is located in Nashua, NH. Being an inquisitive and rather geeky fellow, I pulled up to the main gate and asked the guard for a tour. To my absolute amazement, he phoned someone from the air traffic control floor who promptly arrived and took me inside. I spent the next hour observing air traffic controllers at work and listening to their conversations with planes.
The place looked like a movie set and was darn cool. Unfortunately, in a post-9/11 world such impromptu visits will never, ever happen again.