LAX IT failure: leaps of faith don't work

Summary: Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S.

Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S. Customs and Border Protection (CBP). According to the LA Times and InformationWeek, the failure was caused by a simple equipment malfunction, which was then compounded by poor management and lack of planning.

Here are details of what actually happened:

Around 1:30 p.m., the CPB experienced problems accessing its database containing information on international travelers. Assuming this to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem, meaning the problem was inside the CBP local network. After more hours of troubleshooting, the issue was finally resolved at 11:45 p.m. The real culprit: a failed router.

As with most IT meltdowns, this situation has "management systems failure" written all over it.

First, the CBP did not have adequate contingency and backup plans in place. From the LA Times:

"We're concerned about the slow response by customs," said Steve Lott, chief spokesman in North America for the International Air Transport Assn. Although "we understand that computer systems are not perfect, the frustration is why customs had no contingency plan."

Michael Fleming, spokesman in Los Angeles for the U.S. Customs and Border Protection agency, said agency officials worked as quickly as possible.

"We did everything we could," he said. "We certainly weren't expecting something of this magnitude. In the past, if we had a little glitch," the computers "came up right away."

Second, the CBP did not have sufficient IT staff on-call. From InformationWeek:

Flemming could not immediately confirm how many IT personnel were on site at the time of the incident or provide further detail about the specifics of the CBP hardware failure. "Since the incident, we are making sure IT staff are there all the times instead of on-call," he said. "We're making changes to staffing, equipment, and procedure regarding this incident. It's just unacceptable to everyone to have a repeat of this problem."

Think about this: a 24/7 high-volume government agency, completely dependent on real-time technology, has no working backup plan? The agency expected only small "glitches," so that's the scenario for which they planned.

Leaps of faith don't qualify as a legitimate management or contingency plan for handling routine IT problems.

Update 8/18/07: Damon Poeter has a final post-mortem over at ChannelWeb. Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn't happen again.

Topics: Telcos, Data Management

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

15 comments
Log in or register to join the discussion
  • How could then not tell

    A router is down. How do they not notice that? The security guys should have caught that right away. Assuming they have no security the network guys should have caught too.

    The only way to miss this is to have no monitoring going at all. Then you'd be stuck trying to figure out which router went down and where it is.

    This isn't LAX IT, it's incompetent IT.
    voska
  • How do you not check that first?

    Even without network monitoring how do you not think to check your router before calling the WAN circuit provider. Even more so how do you not have redundant routers in place to avoid a loss of connectivity, even planning for only small problems should account for hardware failure which is a common occurrence.
    jfp
  • Architecture Again

    (1) They should have had redundancy
    (2) They should have had automated monitoring and diagnostics

    For some reason, someone took technical shortcuts, and this is the result.

    Attributing it to insufficient planning or on-call staff is like blaming a plane crash caused by engine failure where the landing gear happened to not lower on the landing gear.

    The management mistakes involved here were made permanent long before the incident actually happened. Adding plans and staff now will just add cost. It doesn't fix the root cause of the problem.
    Erik Engbrecht
  • Telco's have no business...

    troubleshooting anything except the connection to the dsl, T1, or other broadband device. I have yet to meet a telco technician that can do anything but plug up the modem or smart switch. They have no idea how to even configure their own equipment.

    This problem should have been dealt with by an IT specialist before Sprint was even contacted.

    Telco's have been resposible for screwing up many networks by going where they have no business going.
    bjbrock
  • Here is the real problem

    Companies DO NOT want to spend the extra money on backup equipment.

    I work for the State and for years our IT department has been asking for money to have a couple of extra routers, switches, keyboards, and other equipment on hand for when something fails.

    The management has denied these request, stating that the money can be used for other projects. When any peripherals die, then and only then will money be available to replace it.

    I cannot tell you how many routers and switches we have lost, along with the number of lost productivity.

    That is the problem. Management does NOT want to spend money, until there is problem.

    I would rather spend money to prevent a problem.
    BroGnorik
    • Did you read your own post?

      First, you worked for the state, not a company. So the problem is government. Secondly it's not that they didn't want to spend the money, it's that they didn't want to spend the money on YOU.

      Now, put two and two together. You're government and money needs to be spent. It's going to get spent on things that buy votes, not you.

      That's why Roads to Nowhere get built and bridges collapse.
      frgough
      • One problem though

        Systems that are down a lot don't buy you votes. So if an underfunded IT department in government is causing voters grief who will they blame?
        voska
  • I'm going to take a guess here...

    I should know better by now, but I'm going to say this anyway:

    First, I have no idea what the computing infrastructure looks like, what their database platform is, etc.

    However, I'm going to conclude that they are using Microsoft SQL Server. Reason: Microsoft's marketing strategy is that you can use under skilled, lower paid folks to run their wares. When a problem arises, then this idea doesn't work out in real life as well as it does on paper.

    I know - nobody mentioned Microsoft, and I'm sorry for bringing it up - BUT I wanted to point out that 'real' IT professionals are getting harder and harder to come by. And with the avalanche of Microsoft products and Microsoft philosophy being introduced in the data center, we're drowning in a sea of mediocrity.

    Without fail - whenever I've worked with a 'softie (a tech who's only exp. lies with MS products), they invariable have a tenuous grasp on what is going on and usually have very poor problem solving skills.

    -Mike
    SpikeyMike
    • And a

      Spade is a spade and a deuce a deuce. Good call! ]:)
      Linux User 147560
  • It's both technical and management

    Hey Guys,

    Isn't it obvious that the issues stem from problems on both the technical and management sides?

    This kind of thing happens all the time, in both government and private industry. We hear it more on government projects because they are less able to hide it than private firms.
    mkrigsman@...
    • Technical Management

      Yes, the problem is management's fault, because management either failed to build a decent technical team or failed to listen to the technical team they built.

      Manual operations, additional staff, and intricate plans are all ineffectual management bandaids used to give the appearance of decisive action and provide the illusion of security.
      Erik Engbrecht
  • Disaster Plan?

    Without knowing the details I would speculate that they had absolutely no disaster recovery plan, and they followed it to the letter. Most of the posts here suggest an need for more upskilled IT employees and improved technical management and this would likely help. I would offer that they need to work the problem "backwards" by identifying what the minimal acceptable system availability needs to be. Once you lock in on that (usually not as easy as it seems) you can then plan the systems, hardware and WAN/LAN redundancy, personnel response times, etc. They really need to plan the work then work the plan.
    rd55127@...
    • Surely there is truth in what you say

      However, something did go very wrong, and personally I think there was no excuse for it. Think of all the trouble this caused: to the passengers, airlines, police, emergency crews, etc. And all because a NIC card failed.
      mkrigsman@...
    • You guys got it all wrong..

      This wasn't about a router failing, or how long it took to troubleshoot a problem. This was a failure that occurred years earlier in a boardroom when they were developing the operation guidelines for the system. Apparently, they never thought about how to handle business if their system failed. This is a operational policy failure. All systems go down, if it stops your business dead, that's a policy failure.
      gurg13
    • Router failure "a disaster?"

      Hardware fails. Period. In a high-availability system a hardware failure is NOT a disaster, it's an expected event that will be automatically handled.

      Disasters are events like hurricanes and earthquakes knocking out both the main power and the backup generators.

      But you're right. I'm assuming that a security system at a major international airport should be on the same availability level of, say, and ERP system at a major company. Determining availability requirements, especially for non-line-of-business systems, is hard.
      Erik Engbrecht