Between the Lines

Larry Dignan, Andrew Nusca and Rachel King

Virginia's IT outage doesn't pass management sniff test

By | August 30, 2010, 9:44am PDT

Summary: It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state’s IT systems?

Updated: It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state’s IT systems?

Virginia’s IT infrastructure, which is managed by Northrop Grumman, has led to a few statements from agencies. Notably, Virginia’s Department of Motor Vehicles hasn’t been able to process requests for licenses and ID cards. These systems are supposed to be up and running on Tuesday, six days after the outages started to appear.

Meanwhile, the Virginia Information Technologies Agency (VITA) said in a statement that teams have been working throughout the weekend to restore data. In a nutshell, the IT infrastructure of the state of Virginia was reportedly crushed by an EMC storage area network failure. Specifically, the EMC DMX-3 was behind the hardware failure. The Richmond Times-Dispatch reports that several systems are still down. The same paper said that Northrop Grumman will have to pay a fine for the failure. And the real kicker is that recently revised its contract with Northrop Grumman and extended the deal for three years. The state paid an additional $236 million for better service from Northrop Grumman.

Needless to say Virginia residents aren’t pleased. We’ve received a few emails and calls and the comments on the Richmond Times Dispatch site are summed up by this one:

Highlights of the Revised Contract

Operational Efficiencies

Consolidates and strengthens Performance Level Standards with a 15% increase in penalties across the board if Northrop Grumman fails to perform on clearly identified and measured performance standards. - PAY-UP

Improves Incident Response teams to determine technology failures and expedite repair - FAILED

Institutes clear performance measurements for Northrop Grumman that agencies can easily track - FAILED

Adds new services to contract such as improved disaster recovery and enhanced security features - FAILED

Among the key parts of the VITA statement:

  • Successful repair to the storage system hardware is complete, and all but three or possibly four agencies out of the 26 agency systems have been restored. Agencies continue to perform verification testing.
  • Progress continues, but work is not yet complete for the three or four agencies that have some of the largest and most complex databases. These databases make the restoration process extremely time consuming. The unfortunate result is the agencies will not be able to process some customer transactions until additional testing and validation are complete.
  • According to the manufacturer of the storage system (EMC), the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time.

The official explanation for the outage leaves a bit to be desired and frankly doesn’t pass the sniff test. The outage was blamed on the failure of two circuit boards installed and maintained by EMC. EMC said it couldn’t comment on the outage and the state of Virginia and Northrop Grumman were taking the lead on messaging.

Simply put, it’s a bit disconcerting that two circuit boards can bring down a state’s IT infrastructure for nearly a week. Talk about a lack of redundancy.

Among the things that don’t add up in the Virginia IT outage:

  • Why wouldn’t these boards be replaced quickly?
  • Why was there a single point of failure?
  • According to the Washington Post, service was restored for 16 agencies, but 10 require “a lengthy restoration of data.” Where was the disaster planning? After all, Northrop Grumman touted its disaster recovery for the state just two years ago.
  • Where did the IT management fail?

We’re told that Northrop Grumman knows about its IT management issues and is working on correcting the problems. Northrop Grumman was awarded a $2.3 billion IT services contract in 2005. And the company has touted some of the state’s successes. Meanwhile, Northrop Grumman even relocated to Virginia. Hopefully, that proximity will lead to better IT management.

Update: There are a lot of good comments in the talkbacks below, but one IT worker in the U.S. government sent me one via email that captures the questions about this incident well. He wrote:

What has happened to multiple printed copies of disaster recovery documentation & SOP’s (Standard Operating procedures)? What has happened to mandatory random data restoration, accessing & printing that data?  What has happened to viewing IT Sales vendors the same way an intelligent person looks at a used car salesman? What has happened to triple redundancy on multiple hardware & software platforms? I have worked for both D.C. and U.S. Federal government for over 20 years and we have never lost any data. Why? As above mandatory random data restoration to check data backup hardware, software & most especially whatever medium is chosen: disc to tape or disc to disc with printed reports from that data time stamped and signed by multiple teams that verify the data backups and restoration and is part of there performance reviews. Someone wasn’t doing test data restoration and verification from multiple discs at random intervals. Yes it costs a little more to do this but I bet it isn’t even .001% of what this multiple day outage has cost.

When these IT backup vendors come onsite, have guts, stand up for yourself & demand they show you real time data restoration and verification with data time stamped, printed & checked from these data restorations.

Don’t be dazzled by backup speed, it’s not worth a darn if you can’t retrieve the data.

Its all about being able to RESTORE and VERIFY data from your backup medium and being able to access, use and print it in as timely a manner as humanly possible. One of your family members or loved ones at some point may depend upon RESTORATION and VERIFICATION of data for there very lives especially if it is Health Care or Transportation data to name just a few.

Update 2: The state of Virginia has released the following statement.

STATEMENT FROM VIRGINIA SECRETARY OF TECHNOLOGY JIM DUFFEY

5 p.m., Monday, August 30, 2010

On Wednesday, August 25, at approximately 3 p.m., the Commonwealth of Virginia experienced an information technology (IT) infrastructure outage that affected 27 of the Commonwealth’s 89 agencies and caused 13 percent of the Commonwealth’s file servers to fail. The failure was in the equipment used for data storage, commonly known as a storage area network (SAN). Specifically, the SAN that failed was an EMC DMX-3.

According to the manufacturer of the storage system, the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999 percent and no similar failure has occurred in more than one billion hours of run time. A root cause analysis of the failure is currently being conducted.

The storage unit has been repaired and we have been in the meticulous process of carefully restoring data since the failure. This is a time-consuming process that requires close collaboration with the impacted agencies, especially those agencies with large, complex amounts of data.

Twenty-four of the 27 affected agencies were up and running this morning. However, three agencies are not yet fully operational. These agencies are the Department of Motor Vehicles, Department of Taxation and the State Board of Elections. Other agencies continue to experience minor issues.

The DMV was heavily impacted by this hardware failure and has been unable to process in-person driver’s licenses or ID cards at its 74 customer service centers. Please keep checking the DMV website for updates concerning this situation. We understand that this is a great inconvenience for our citizens and we are doing everything in our power to restore service as quickly as possible.

Teams and staff from the affected state agencies, the Virginia Information Technologies Agency, Northrop Grumman and EMC continue to work around the clock to correct the situation. I have confidence in the teams that are working on this problem. They are aggressively executing our recovery plan and are working tirelessly to restore all the affected agencies to a fully operational status. They have made significant progress and continue to do so. I ask for the continued understanding and patience of state employees and citizens as this work continues.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Larry Dignan is Editor in Chief of ZDNet and SmartPlanet as well as Editorial Director of ZDNet's sister site TechRepublic.

Disclosure

Larry Dignan

Larry Dignan has nothing to disclose. He doesn’t hold investments in the technology companies he covers.

Biography

Larry Dignan

Larry Dignan is Editor in Chief of ZDNet and SmartPlanet as well as Editorial Director of ZDNet's sister site TechRepublic. He was most recently Executive Editor of News and Blogs at ZDNet. Prior to that he was executive news editor at eWeek and news editor at Baseline. He also served as the East Coast news editor and finance editor at CNET News.com. Larry has covered the technology and financial services industry since 1995, publishing articles in WallStreetWeek.com, Inter@ctive Week, The New York Times, and Financial Planning magazine. He's a graduate of the Columbia School of Journalism and the University of Delaware.

For daily updates, follow Larry on Twitter.

Talkback Most Recent of 71 Talkback(s)

  • RE: Virginia's IT outage doesn't pass management sniff test
    This will be great fodder for me when the SAN vendors come in and try to sell me some more wonderful "money saving" solutions. For years everyone has tried to get me to go to a complete SAN solution and my argument has always been, why do want to put in a single point of failure for multiple systems.
    ZDNet Gravatar
    gwoodson
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @gwoodson
    Hello, just to say that REAL SAN solution does not have any SPOF. If there is any then it would be incompetent or dishonest SAN solution provider.
    In the company I worked for we had incidents but NEVER lost the data. The lengthiest recovery was 36 hours.
    SAN is excellent solution for mission critical and high availability. In fact it is the only one (IMHO).
    ZDNet Gravatar
    njoncic
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @njoncic
    I am sorry, 36hrs is a joke. I never lost the data and never been down for over two hours. You can only have true redundancy, if your willing buy two SAN systems. I have been in this business for over 20 yrs. I have dealt with many SAN systems and most are not redundant. They claim redundancy by doing having multi-storage processors, two fiber switches etc...but unless your willing to invest in more than one you still at a SPOF.
    ZDNet Gravatar
    gwoodson
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @gwoodson I wouldn't nix going to a SAN solution because of this. As long as the proper redundancy and failovers are incorporated, SANs are a great way to save A LOT of money by consolidating storage and virtualizing servers. This sounds more like Northrop Grumman was cutting corners and not putting in the required redundancy, and somebody with the state of Virginia wasn't auditing them to make sure the state was getting what they were being promised.
    ZDNet Gravatar
    Flying Pig
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @Flying Pig
    Do you work for EMC or Hitachi? Just kidding. SANs have their place but not always. You either have to be willing to spend what it takes or do not implement it at all.
    ZDNet Gravatar
    gwoodson
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @Flying Pig yeah I have to agree, a properly setup solution does not have a SPOF, and when you are dealing with 100's of TB's worth of data, you are better off with a SAN, I probably wouldnt use EMC, not because of this story, but because they have a pretty lackluster product, that is ridiculous to manage. gwoodson, probably doesnt have that much data to deal with, and isnt constrained as much by federal regulations where he has to keep all that data for at least 7 years, not to mention your comment about virtualizing servers, and desktops for that matter, we use a SAN in conjuction with our thin client solution. the beauty is we are redundant, on the sans and with another SAN in another data center. and I hate to say it, but 36 hours of DT, isnt bad depending on how much data you are recovering.
    ZDNet Gravatar
    nickdangerthirdi@...
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @Flying Pig
    No, I am just dealing with HIPAA and SAS70. Contractual requirements to have availability with 8 hours during a disaster recover situation. This is the poster child for having too many systems that should be discrete, interconnected in a way that makes they vulnerable. Too many young engineers have gotten lazy in believing the hardware is bullet proof and that the SAN is the only solution. I am not saying that SANs do not have their place but knowing what that is, is what is valuable.
    ZDNet Gravatar
    gwoodson
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @gwoodson
    You seem to making quite a lot of judgments about something where you have no information to backup your assertions.

    The fact is that redundant SAN systems very successfully deliver on performance, availability and recovery requirements. Thousands and thousands of them. This is not a failure of all SAN implementations, but one specific instance which, for all we know, could be a one-in-a-million failure scenario. You have no basis on which to justify your claim that it's a "poster child" for anything.

    You also don't know anything about the DMV database, and it's entirely possible that even if it had been on non-SAN storage, it could take over 36 hours to recover in the event of a catastrophic failure. I've seen some that take longer.

    Contractual SLAs are great, but there are also always long-term financial considerations; and often the people in control of the finances don't allow you to implement all of the redundancies necessary. In other cases, they make good, clear risk assessments and determine that certain costs are not worth incurring based on probabilities. This is true even in your non-SAN environment.

    This kind of disaster might be avoidable with triple redundancy, but unless the customer (VITA) is willing to pay for that level of assurance, then it won't be implemented.

    And, in this case, NG is being fined for not meeting their contractual obligations. That's how it works.
    ZDNet Gravatar
    JeffLS
    31st Aug 2010
  • ZDNet Gravatar
    fghkjk
    31st Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @gwoodson
    There are many benefits to be gained from shared infrastructures.

    Sure, they have to be designed, implemented, managed, and tested regularly to maintain availability.

    The VITA situation is hard to fully assess without being there and analyzing their configuration. If two boards in two different, redundant arrays went out at the same time, that's a pretty bizarre situation; and in that case wouldn't be a single point of failure.

    I've worked on SANs for 10 years and they can be extremely highly available, with data replication across large geographies where necessary. Proper planning, design, and execution is important to be sure.

    And if this is truly *one* of those incredibly odd, multiple outage situations, well you can hardly make a serious argument that tens of thousands of successful implementations are mistakes.
    ZDNet Gravatar
    JeffLS
    31st Aug 2010
  • "Why was there a single point of failure?"
    Did you read what you wrote?

    "The outage was blamed on the failure of two circuit boards installed and maintained by EMC."

    That is a two point failure. I do agre, however, it seems pretty benign to bring the entire storage system to a halt for 5 days.
    ZDNet Gravatar
    Bruizer
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    @Bruizer
    I think gwoodson was speaking about the SAN as the "...single point of failure...".

    I tend to agree with him, having experienced a couple of SAN failures ourselves. I think that if two circuit boards can bring down the entire system for six days, it wouldn't be too much to ask that Northrop Grumman have spares more available--maybe not onsite, but certainly better availability than what they were providing.
    ZDNet Gravatar
    TranMan
    30th Aug 2010
  • Without boing the system design...
    @TranMan

    The redundant SAN might have also failed. Two circuit boards in two different systems.

    Key point, it may have NOT been a single point failure.
    ZDNet Gravatar
    Bruizer
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    The question is, did the failure of the circuit boards bring down the system or did they fail in some undetected fashion so that they spent days/weeks corrupting data and that's why it is taking so long to restore?
    ZDNet Gravatar
    r_rosen
    30th Aug 2010
  • RE: Virginia's IT outage doesn't pass management sniff test
    I wonder when the last failover test occurred...
    ZDNet Gravatar
    abear4562
    30th Aug 2010

Talkback - Tell Us What You Think

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources