LAX IT failure: leaps of faith don't work
Summary: Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S.
Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S. Customs and Border Protection (CBP). According to the LA Times and InformationWeek, the failure was caused by a simple equipment malfunction, which was then compounded by poor management and lack of planning.
Here are details of what actually happened:
Around 1:30 p.m., the CPB experienced problems accessing its database containing information on international travelers. Assuming this to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem, meaning the problem was inside the CBP local network. After more hours of troubleshooting, the issue was finally resolved at 11:45 p.m. The real culprit: a failed router.
As with most IT meltdowns, this situation has "management systems failure" written all over it.
First, the CBP did not have adequate contingency and backup plans in place. From the LA Times:
"We're concerned about the slow response by customs," said Steve Lott, chief spokesman in North America for the International Air Transport Assn. Although "we understand that computer systems are not perfect, the frustration is why customs had no contingency plan."
Michael Fleming, spokesman in Los Angeles for the U.S. Customs and Border Protection agency, said agency officials worked as quickly as possible.
"We did everything we could," he said. "We certainly weren't expecting something of this magnitude. In the past, if we had a little glitch," the computers "came up right away."
Second, the CBP did not have sufficient IT staff on-call. From InformationWeek:
Flemming could not immediately confirm how many IT personnel were on site at the time of the incident or provide further detail about the specifics of the CBP hardware failure. "Since the incident, we are making sure IT staff are there all the times instead of on-call," he said. "We're making changes to staffing, equipment, and procedure regarding this incident. It's just unacceptable to everyone to have a repeat of this problem."
Think about this: a 24/7 high-volume government agency, completely dependent on real-time technology, has no working backup plan? The agency expected only small "glitches," so that's the scenario for which they planned.
Leaps of faith don't qualify as a legitimate management or contingency plan for handling routine IT problems.
Update 8/18/07: Damon Poeter has a final post-mortem over at ChannelWeb. Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn't happen again.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
How could then not tell
The only way to miss this is to have no monitoring going at all. Then you'd be stuck trying to figure out which router went down and where it is.
This isn't LAX IT, it's incompetent IT.
How do you not check that first?
Architecture Again
(2) They should have had automated monitoring and diagnostics
For some reason, someone took technical shortcuts, and this is the result.
Attributing it to insufficient planning or on-call staff is like blaming a plane crash caused by engine failure where the landing gear happened to not lower on the landing gear.
The management mistakes involved here were made permanent long before the incident actually happened. Adding plans and staff now will just add cost. It doesn't fix the root cause of the problem.
Telco's have no business...
This problem should have been dealt with by an IT specialist before Sprint was even contacted.
Telco's have been resposible for screwing up many networks by going where they have no business going.
Here is the real problem
I work for the State and for years our IT department has been asking for money to have a couple of extra routers, switches, keyboards, and other equipment on hand for when something fails.
The management has denied these request, stating that the money can be used for other projects. When any peripherals die, then and only then will money be available to replace it.
I cannot tell you how many routers and switches we have lost, along with the number of lost productivity.
That is the problem. Management does NOT want to spend money, until there is problem.
I would rather spend money to prevent a problem.
Did you read your own post?
Now, put two and two together. You're government and money needs to be spent. It's going to get spent on things that buy votes, not you.
That's why Roads to Nowhere get built and bridges collapse.
One problem though
I'm going to take a guess here...
First, I have no idea what the computing infrastructure looks like, what their database platform is, etc.
However, I'm going to conclude that they are using Microsoft SQL Server. Reason: Microsoft's marketing strategy is that you can use under skilled, lower paid folks to run their wares. When a problem arises, then this idea doesn't work out in real life as well as it does on paper.
I know - nobody mentioned Microsoft, and I'm sorry for bringing it up - BUT I wanted to point out that 'real' IT professionals are getting harder and harder to come by. And with the avalanche of Microsoft products and Microsoft philosophy being introduced in the data center, we're drowning in a sea of mediocrity.
Without fail - whenever I've worked with a 'softie (a tech who's only exp. lies with MS products), they invariable have a tenuous grasp on what is going on and usually have very poor problem solving skills.
-Mike
And a
It's both technical and management
Isn't it obvious that the issues stem from problems on both the technical and management sides?
This kind of thing happens all the time, in both government and private industry. We hear it more on government projects because they are less able to hide it than private firms.
Technical Management
Manual operations, additional staff, and intricate plans are all ineffectual management bandaids used to give the appearance of decisive action and provide the illusion of security.
Disaster Plan?
Surely there is truth in what you say
You guys got it all wrong..
Router failure "a disaster?"
Disasters are events like hurricanes and earthquakes knocking out both the main power and the backup generators.
But you're right. I'm assuming that a security system at a major international airport should be on the same availability level of, say, and ERP system at a major company. Determining availability requirements, especially for non-line-of-business systems, is hard.