LAX IT failure: leaps of faith don't work

By | August 15, 2007, 4:55am PDT

Summary: Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S. Customs and Border Protection (CBP). According to the LA Times and InformationWeek, the failure was caused by a simple equipment malfunction, which was then compounded by poor management [...]

Recently, I described an IT failure at the Los Angeles International Airport that caused 20,000 passengers to be stranded while waiting to be processed through U.S. Customs and Border Protection (CBP). According to the LA Times and InformationWeek, the failure was caused by a simple equipment malfunction, which was then compounded by poor management and lack of planning.

Here are details of what actually happened:

Around 1:30 p.m., the CPB experienced problems accessing its database containing information on international travelers. Assuming this to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem, meaning the problem was inside the CBP local network. After more hours of troubleshooting, the issue was finally resolved at 11:45 p.m. The real culprit: a failed router.

As with most IT meltdowns, this situation has “management systems failure” written all over it.

First, the CBP did not have adequate contingency and backup plans in place. From the LA Times:

“We’re concerned about the slow response by customs,” said Steve Lott, chief spokesman in North America for the International Air Transport Assn. Although “we understand that computer systems are not perfect, the frustration is why customs had no contingency plan.”

Michael Fleming, spokesman in Los Angeles for the U.S. Customs and Border Protection agency, said agency officials worked as quickly as possible.

“We did everything we could,” he said. “We certainly weren’t expecting something of this magnitude. In the past, if we had a little glitch,” the computers “came up right away.”

Second, the CBP did not have sufficient IT staff on-call. From InformationWeek:

Flemming could not immediately confirm how many IT personnel were on site at the time of the incident or provide further detail about the specifics of the CBP hardware failure. “Since the incident, we are making sure IT staff are there all the times instead of on-call,” he said. “We’re making changes to staffing, equipment, and procedure regarding this incident. It’s just unacceptable to everyone to have a repeat of this problem.”

Think about this: a 24/7 high-volume government agency, completely dependent on real-time technology, has no working backup plan? The agency expected only small “glitches,” so that’s the scenario for which they planned.

Leaps of faith don’t qualify as a legitimate management or contingency plan for handling routine IT problems.

Update 8/18/07: Damon Poeter has a final post-mortem over at ChannelWeb. Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn’t happen again.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Michael Krigsman is a recognized authority on the causes and prevention of IT failures.

Disclosure

Michael Krigsman

Michael Krigsman writes and speaks about technology in a manner that most observers consider to be fair and balanced. Michael believes that writing about IT failures, which often have complex causes, creates a unique obligation to be reasonable and accurate in both reporting and analysis.

Michael maintains active personal and professional relationships with enterprise technology buyers, vendors, analyst firms (or individual analysts), consultants, and system integrators. As CEO of Asuret, Michael sells and delivers paid services to members of these same groups.

Vendors regularly reimburse Michael's out-of-pocket travel expenses to attend industry conferences and events. Conference organizers frequently waive entry fees when Michael attends industry events. Michael often speaks at industry conferences and events.

He is a member of the Enterprise Irregulars, a loose association of consultants, investors, industry representatives, analysts, and users of enterprise software.

For daily updates on Michael's activities, follow him on Twitter.

Biography

Michael Krigsman

Michael Krigsman is CEO of Asuret, Inc., a consulting company dedicated to reducing technology implementation failures. Asuret's suite of software tools improve the success rate of enterprise software deployments by quantifying and measuring governance issues that cause most project failures. Michael led the research effort underlying Asuret's model of collective intelligence and its practical application to reducing IT failures in consulting environments. He is a recognized authority on the causes and prevention of IT failures and is frequently quoted in the press on IT project and related CIO issues. He is considered an enterprise software industry "influencer" and provides advice to technology buyers, vendors, and services firms.

Previously, Michael served as CEO of Cambridge Publications, which develops tools and processes for software implementations and related business practice automation projects. Michael has been involved with hundreds of software development projects, for companies ranging from small startups to Fortune 500 organizations. Michael graduated with an M.B.A. from Boston University and a B.A. from Bard College. He is a Board member of the America's Cup Hall of Fame and the Herreshoff Marine Museum in Bristol, RI.

Talkback Most Recent of 15 Talkback(s)

  • How could then not tell
    A router is down. How do they not notice that? The security guys should have caught that right away. Assuming they have no security the network guys should have caught too.

    The only way to miss this is to have no monitoring going at all. Then you'd be stuck trying to figure out which router went down and where it is.

    This isn't LAX IT, it's incompetent IT.
    ZDNet Gravatar
    voska
    15th Aug 2007
  • How do you not check that first?
    Even without network monitoring how do you not think to check your router before calling the WAN circuit provider. Even more so how do you not have redundant routers in place to avoid a loss of connectivity, even planning for only small problems should account for hardware failure which is a common occurrence.
    ZDNet Gravatar
    jfp
    15th Aug 2007
  • Architecture Again
    (1) They should have had redundancy
    (2) They should have had automated monitoring and diagnostics

    For some reason, someone took technical shortcuts, and this is the result.

    Attributing it to insufficient planning or on-call staff is like blaming a plane crash caused by engine failure where the landing gear happened to not lower on the landing gear.

    The management mistakes involved here were made permanent long before the incident actually happened. Adding plans and staff now will just add cost. It doesn't fix the root cause of the problem.
    ZDNet Gravatar
    Erik Engbrecht
    15th Aug 2007
  • Telco's have no business...
    troubleshooting anything except the connection to the dsl, T1, or other broadband device. I have yet to meet a telco technician that can do anything but plug up the modem or smart switch. They have no idea how to even configure their own equipment.

    This problem should have been dealt with by an IT specialist before Sprint was even contacted.

    Telco's have been resposible for screwing up many networks by going where they have no business going.
    ZDNet Gravatar
    bjbrock
    15th Aug 2007
  • Here is the real problem
    Companies DO NOT want to spend the extra money on backup equipment.

    I work for the State and for years our IT department has been asking for money to have a couple of extra routers, switches, keyboards, and other equipment on hand for when something fails.

    The management has denied these request, stating that the money can be used for other projects. When any peripherals die, then and only then will money be available to replace it.

    I cannot tell you how many routers and switches we have lost, along with the number of lost productivity.

    That is the problem. Management does NOT want to spend money, until there is problem.

    I would rather spend money to prevent a problem.
    ZDNet Gravatar
    BroGnorik
    15th Aug 2007
  • Did you read your own post?
    First, you worked for the state, not a company. So the problem is government. Secondly it's not that they didn't want to spend the money, it's that they didn't want to spend the money on YOU.

    Now, put two and two together. You're government and money needs to be spent. It's going to get spent on things that buy votes, not you.

    That's why Roads to Nowhere get built and bridges collapse.
    ZDNet Gravatar
    frgough
    15th Aug 2007
  • One problem though
    Systems that are down a lot don't buy you votes. So if an underfunded IT department in government is causing voters grief who will they blame?
    ZDNet Gravatar
    voska
    15th Aug 2007
  • I'm going to take a guess here...
    I should know better by now, but I'm going to say this anyway:

    First, I have no idea what the computing infrastructure looks like, what their database platform is, etc.

    However, I'm going to conclude that they are using Microsoft SQL Server. Reason: Microsoft's marketing strategy is that you can use under skilled, lower paid folks to run their wares. When a problem arises, then this idea doesn't work out in real life as well as it does on paper.

    I know - nobody mentioned Microsoft, and I'm sorry for bringing it up - BUT I wanted to point out that 'real' IT professionals are getting harder and harder to come by. And with the avalanche of Microsoft products and Microsoft philosophy being introduced in the data center, we're drowning in a sea of mediocrity.

    Without fail - whenever I've worked with a 'softie (a tech who's only exp. lies with MS products), they invariable have a tenuous grasp on what is going on and usually have very poor problem solving skills.

    -Mike
    ZDNet Gravatar
    SpikeyMike
    15th Aug 2007
  • And a
    Spade is a spade and a deuce a deuce. Good call! devil
    ZDNet Gravatar
    Linux User 147560
    15th Aug 2007
  • ZDNet Blogger

    It's both technical and management
    Hey Guys,

    Isn't it obvious that the issues stem from problems on both the technical and management sides?

    This kind of thing happens all the time, in both government and private industry. We hear it more on government projects because they are less able to hide it than private firms.
    ZDNet Gravatar
    mkrigsman@...
    15th Aug 2007
  • Technical Management
    Yes, the problem is management's fault, because management either failed to build a decent technical team or failed to listen to the technical team they built.

    Manual operations, additional staff, and intricate plans are all ineffectual management bandaids used to give the appearance of decisive action and provide the illusion of security.
    ZDNet Gravatar
    Erik Engbrecht
    15th Aug 2007
  • Disaster Plan?
    Without knowing the details I would speculate that they had absolutely no disaster recovery plan, and they followed it to the letter. Most of the posts here suggest an need for more upskilled IT employees and improved technical management and this would likely help. I would offer that they need to work the problem "backwards" by identifying what the minimal acceptable system availability needs to be. Once you lock in on that (usually not as easy as it seems) you can then plan the systems, hardware and WAN/LAN redundancy, personnel response times, etc. They really need to plan the work then work the plan.
    ZDNet Gravatar
    rd55127@...
    15th Aug 2007
  • ZDNet Blogger

    Surely there is truth in what you say
    However, something did go very wrong, and personally I think there was no excuse for it. Think of all the trouble this caused: to the passengers, airlines, police, emergency crews, etc. And all because a NIC card failed.
    ZDNet Gravatar
    mkrigsman@...
    15th Aug 2007
  • You guys got it all wrong..
    This wasn't about a router failing, or how long it took to troubleshoot a problem. This was a failure that occurred years earlier in a boardroom when they were developing the operation guidelines for the system. Apparently, they never thought about how to handle business if their system failed. This is a operational policy failure. All systems go down, if it stops your business dead, that's a policy failure.
    ZDNet Gravatar
    gurg13
    15th Aug 2007
  • Router failure "a disaster?"
    Hardware fails. Period. In a high-availability system a hardware failure is NOT a disaster, it's an expected event that will be automatically handled.

    Disasters are events like hurricanes and earthquakes knocking out both the main power and the backup generators.

    But you're right. I'm assuming that a security system at a major international airport should be on the same availability level of, say, and ERP system at a major company. Determining availability requirements, especially for non-line-of-business systems, is hard.
    ZDNet Gravatar
    Erik Engbrecht
    16th Aug 2007

Talkback - Tell Us What You Think

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
Click Here

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources