Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S. Customs and Border Protection (CBP). According to the LA Times and InformationWeek, the failure was caused by a simple equipment malfunction, which was then compounded by poor management and lack of planning.
Here are details of what actually happened:
Around 1:30 p.m., the CPB experienced problems accessing its database containing information on international travelers. Assuming this to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem, meaning the problem was inside the CBP local network. After more hours of troubleshooting, the issue was finally resolved at 11:45 p.m. The real culprit: a failed router.
As with most IT meltdowns, this situation has "management systems failure" written all over it.
First, the CBP did not have adequate contingency and backup plans in place. From the LA Times:
"We're concerned about the slow response by customs," said Steve Lott, chief spokesman in North America for the International Air Transport Assn. Although "we understand that computer systems are not perfect, the frustration is why customs had no contingency plan."
Michael Fleming, spokesman in Los Angeles for the U.S. Customs and Border Protection agency, said agency officials worked as quickly as possible.
"We did everything we could," he said. "We certainly weren't expecting something of this magnitude. In the past, if we had a little glitch," the computers "came up right away."
Second, the CBP did not have sufficient IT staff on-call. From InformationWeek:
Flemming could not immediately confirm how many IT personnel were on site at the time of the incident or provide further detail about the specifics of the CBP hardware failure. "Since the incident, we are making sure IT staff are there all the times instead of on-call," he said. "We're making changes to staffing, equipment, and procedure regarding this incident. It's just unacceptable to everyone to have a repeat of this problem."
Think about this: a 24/7 high-volume government agency, completely dependent on real-time technology, has no working backup plan? The agency expected only small "glitches," so that's the scenario for which they planned.
Leaps of faith don't qualify as a legitimate management or contingency plan for handling routine IT problems.
Update 8/18/07: Damon Poeter has a final post-mortem over at ChannelWeb. Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn't happen again.