Virginia's IT outage doesn't pass management sniff test

Virginia's IT outage doesn't pass management sniff test

Summary: It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state's IT systems?

SHARE:
TOPICS: Outage
69

Updated: It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state's IT systems?

Virginia's IT infrastructure, which is managed by Northrop Grumman, has led to a few statements from agencies. Notably, Virginia's Department of Motor Vehicles hasn't been able to process requests for licenses and ID cards. These systems are supposed to be up and running on Tuesday, six days after the outages started to appear.

Meanwhile, the Virginia Information Technologies Agency (VITA) said in a statement that teams have been working throughout the weekend to restore data. In a nutshell, the IT infrastructure of the state of Virginia was reportedly crushed by an EMC storage area network failure. Specifically, the EMC DMX-3 was behind the hardware failure. The Richmond Times-Dispatch reports that several systems are still down. The same paper said that Northrop Grumman will have to pay a fine for the failure. And the real kicker is that recently revised its contract with Northrop Grumman and extended the deal for three years. The state paid an additional $236 million for better service from Northrop Grumman.

Needless to say Virginia residents aren't pleased. We've received a few emails and calls and the comments on the Richmond Times Dispatch site are summed up by this one:

Highlights of the Revised Contract

Operational Efficiencies

Consolidates and strengthens Performance Level Standards with a 15% increase in penalties across the board if Northrop Grumman fails to perform on clearly identified and measured performance standards. - PAY-UP

Improves Incident Response teams to determine technology failures and expedite repair - FAILED

Institutes clear performance measurements for Northrop Grumman that agencies can easily track - FAILED

Adds new services to contract such as improved disaster recovery and enhanced security features - FAILED

Among the key parts of the VITA statement:

  • Successful repair to the storage system hardware is complete, and all but three or possibly four agencies out of the 26 agency systems have been restored. Agencies continue to perform verification testing.
  • Progress continues, but work is not yet complete for the three or four agencies that have some of the largest and most complex databases. These databases make the restoration process extremely time consuming. The unfortunate result is the agencies will not be able to process some customer transactions until additional testing and validation are complete.
  • According to the manufacturer of the storage system (EMC), the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time.

The official explanation for the outage leaves a bit to be desired and frankly doesn't pass the sniff test. The outage was blamed on the failure of two circuit boards installed and maintained by EMC. EMC said it couldn't comment on the outage and the state of Virginia and Northrop Grumman were taking the lead on messaging.

Simply put, it's a bit disconcerting that two circuit boards can bring down a state's IT infrastructure for nearly a week. Talk about a lack of redundancy.

Among the things that don't add up in the Virginia IT outage:

  • Why wouldn't these boards be replaced quickly?
  • Why was there a single point of failure?
  • According to the Washington Post, service was restored for 16 agencies, but 10 require "a lengthy restoration of data." Where was the disaster planning? After all, Northrop Grumman touted its disaster recovery for the state just two years ago.
  • Where did the IT management fail?

We're told that Northrop Grumman knows about its IT management issues and is working on correcting the problems. Northrop Grumman was awarded a $2.3 billion IT services contract in 2005. And the company has touted some of the state's successes. Meanwhile, Northrop Grumman even relocated to Virginia. Hopefully, that proximity will lead to better IT management.

Update: There are a lot of good comments in the talkbacks below, but one IT worker in the U.S. government sent me one via email that captures the questions about this incident well. He wrote:

What has happened to multiple printed copies of disaster recovery documentation & SOP's (Standard Operating procedures)? What has happened to mandatory random data restoration, accessing & printing that data?  What has happened to viewing IT Sales vendors the same way an intelligent person looks at a used car salesman? What has happened to triple redundancy on multiple hardware & software platforms? I have worked for both D.C. and U.S. Federal government for over 20 years and we have never lost any data. Why? As above mandatory random data restoration to check data backup hardware, software & most especially whatever medium is chosen: disc to tape or disc to disc with printed reports from that data time stamped and signed by multiple teams that verify the data backups and restoration and is part of there performance reviews. Someone wasn't doing test data restoration and verification from multiple discs at random intervals. Yes it costs a little more to do this but I bet it isn't even .001% of what this multiple day outage has cost.

When these IT backup vendors come onsite, have guts, stand up for yourself & demand they show you real time data restoration and verification with data time stamped, printed & checked from these data restorations.

Don't be dazzled by backup speed, it’s not worth a darn if you can’t retrieve the data.

Its all about being able to RESTORE and VERIFY data from your backup medium and being able to access, use and print it in as timely a manner as humanly possible. One of your family members or loved ones at some point may depend upon RESTORATION and VERIFICATION of data for there very lives especially if it is Health Care or Transportation data to name just a few.

Update 2: The state of Virginia has released the following statement.

STATEMENT FROM VIRGINIA SECRETARY OF TECHNOLOGY JIM DUFFEY

5 p.m., Monday, August 30, 2010

On Wednesday, August 25, at approximately 3 p.m., the Commonwealth of Virginia experienced an information technology (IT) infrastructure outage that affected 27 of the Commonwealth’s 89 agencies and caused 13 percent of the Commonwealth’s file servers to fail. The failure was in the equipment used for data storage, commonly known as a storage area network (SAN). Specifically, the SAN that failed was an EMC DMX-3.

According to the manufacturer of the storage system, the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999 percent and no similar failure has occurred in more than one billion hours of run time. A root cause analysis of the failure is currently being conducted.

The storage unit has been repaired and we have been in the meticulous process of carefully restoring data since the failure. This is a time-consuming process that requires close collaboration with the impacted agencies, especially those agencies with large, complex amounts of data.

Twenty-four of the 27 affected agencies were up and running this morning. However, three agencies are not yet fully operational. These agencies are the Department of Motor Vehicles, Department of Taxation and the State Board of Elections. Other agencies continue to experience minor issues.

The DMV was heavily impacted by this hardware failure and has been unable to process in-person driver’s licenses or ID cards at its 74 customer service centers. Please keep checking the DMV website for updates concerning this situation. We understand that this is a great inconvenience for our citizens and we are doing everything in our power to restore service as quickly as possible.

Teams and staff from the affected state agencies, the Virginia Information Technologies Agency, Northrop Grumman and EMC continue to work around the clock to correct the situation. I have confidence in the teams that are working on this problem. They are aggressively executing our recovery plan and are working tirelessly to restore all the affected agencies to a fully operational status. They have made significant progress and continue to do so. I ask for the continued understanding and patience of state employees and citizens as this work continues.

Topic: Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

69 comments
Log in or register to join the discussion
  • RE: Virginia's IT outage doesn't pass management sniff test

    This will be great fodder for me when the SAN vendors come in and try to sell me some more wonderful "money saving" solutions. For years everyone has tried to get me to go to a complete SAN solution and my argument has always been, why do want to put in a single point of failure for multiple systems.
    gwoodson
    • RE: Virginia's IT outage doesn't pass management sniff test

      @gwoodson
      Hello, just to say that REAL SAN solution does not have any SPOF. If there is any then it would be incompetent or dishonest SAN solution provider.
      In the company I worked for we had incidents but NEVER lost the data. The lengthiest recovery was 36 hours.
      SAN is excellent solution for mission critical and high availability. In fact it is the only one (IMHO).
      njoncic
      • RE: Virginia's IT outage doesn't pass management sniff test

        @njoncic
        I am sorry, 36hrs is a joke. I never lost the data and never been down for over two hours. You can only have true redundancy, if your willing buy two SAN systems. I have been in this business for over 20 yrs. I have dealt with many SAN systems and most are not redundant. They claim redundancy by doing having multi-storage processors, two fiber switches etc...but unless your willing to invest in more than one you still at a SPOF.
        gwoodson
    • RE: Virginia's IT outage doesn't pass management sniff test

      @gwoodson I wouldn't nix going to a SAN solution because of this. As long as the proper redundancy and failovers are incorporated, SANs are a great way to save A LOT of money by consolidating storage and virtualizing servers. This sounds more like Northrop Grumman was cutting corners and not putting in the required redundancy, and somebody with the state of Virginia wasn't auditing them to make sure the state was getting what they were being promised.
      Flying Pig
      • RE: Virginia's IT outage doesn't pass management sniff test

        @Flying Pig
        Do you work for EMC or Hitachi? Just kidding. SANs have their place but not always. You either have to be willing to spend what it takes or do not implement it at all.
        gwoodson
      • RE: Virginia's IT outage doesn't pass management sniff test

        @Flying Pig yeah I have to agree, a properly setup solution does not have a SPOF, and when you are dealing with 100's of TB's worth of data, you are better off with a SAN, I probably wouldnt use EMC, not because of this story, but because they have a pretty lackluster product, that is ridiculous to manage. gwoodson, probably doesnt have that much data to deal with, and isnt constrained as much by federal regulations where he has to keep all that data for at least 7 years, not to mention your comment about virtualizing servers, and desktops for that matter, we use a SAN in conjuction with our thin client solution. the beauty is we are redundant, on the sans and with another SAN in another data center. and I hate to say it, but 36 hours of DT, isnt bad depending on how much data you are recovering.
        nickdangerthirdi@...
      • RE: Virginia's IT outage doesn't pass management sniff test

        @Flying Pig
        No, I am just dealing with HIPAA and SAS70. Contractual requirements to have availability with 8 hours during a disaster recover situation. This is the poster child for having too many systems that should be discrete, interconnected in a way that makes they vulnerable. Too many young engineers have gotten lazy in believing the hardware is bullet proof and that the SAN is the only solution. I am not saying that SANs do not have their place but knowing what that is, is what is valuable.
        gwoodson
      • RE: Virginia's IT outage doesn't pass management sniff test

        @gwoodson
        You seem to making quite a lot of judgments about something where you have no information to backup your assertions.

        The fact is that redundant SAN systems very successfully deliver on performance, availability and recovery requirements. Thousands and thousands of them. This is not a failure of all SAN implementations, but one specific instance which, for all we know, could be a one-in-a-million failure scenario. You have no basis on which to justify your claim that it's a "poster child" for anything.

        You also don't know anything about the DMV database, and it's entirely possible that even if it had been on non-SAN storage, it could take over 36 hours to recover in the event of a catastrophic failure. I've seen some that take longer.

        Contractual SLAs are great, but there are also always long-term financial considerations; and often the people in control of the finances don't allow you to implement all of the redundancies necessary. In other cases, they make good, clear risk assessments and determine that certain costs are not worth incurring based on probabilities. This is true even in your non-SAN environment.

        This kind of disaster might be avoidable with triple redundancy, but unless the customer (VITA) is willing to pay for that level of assurance, then it won't be implemented.

        And, in this case, NG is being fined for not meeting their contractual obligations. That's how it works.
        nobodynowherezz
    • Message has been deleted.

      fghkjk
    • RE: Virginia's IT outage doesn't pass management sniff test

      @gwoodson
      There are many benefits to be gained from shared infrastructures.

      Sure, they have to be designed, implemented, managed, and tested regularly to maintain availability.

      The VITA situation is hard to fully assess without being there and analyzing their configuration. If two boards in two different, redundant arrays went out at the same time, that's a pretty bizarre situation; and in that case wouldn't be a single point of failure.

      I've worked on SANs for 10 years and they can be extremely highly available, with data replication across large geographies where necessary. Proper planning, design, and execution is important to be sure.

      And if this is truly *one* of those incredibly odd, multiple outage situations, well you can hardly make a serious argument that tens of thousands of successful implementations are mistakes.
      nobodynowherezz
  • "Why was there a single point of failure?"

    Did you read what you wrote?

    [i]"The outage was blamed on the failure of two circuit boards installed and maintained by EMC."[/i]

    That is a two point failure. I do agre, however, it seems pretty benign to bring the entire storage system to a halt for 5 days.
    Bruizer
    • RE: Virginia's IT outage doesn't pass management sniff test

      @Bruizer
      I think gwoodson was speaking about the SAN as the "...single point of failure...".

      I tend to agree with him, having experienced a couple of SAN failures ourselves. I think that if two circuit boards can bring down the entire system for six days, it wouldn't be too much to ask that Northrop Grumman have spares more available--maybe not onsite, but certainly better availability than what they were providing.
      TranMan
      • Without boing the system design...

        @TranMan

        The redundant SAN might have also failed. Two circuit boards in two different systems.

        Key point, it may have NOT been a single point failure.
        Bruizer
  • RE: Virginia's IT outage doesn't pass management sniff test

    The question is, did the failure of the circuit boards bring down the system or did they fail in some undetected fashion so that they spent days/weeks corrupting data and that's why it is taking so long to restore?
    r_rosen
  • RE: Virginia's IT outage doesn't pass management sniff test

    I wonder when the last failover test occurred...
    abear4562
  • $2.3 billion contract...

    and you still don't have total redundancy!? So much for the excuse of 'well we couldn't afford it'. EMC is going to get the finger pointed at them, but the total blame is on NG and the government agencies that had oversite for not having redundancy or even a Plan B in place.

    As we've seen with the BP disaster the 'big boys' at the big companies really aren't any smarter...they just have more expensive toys that break.
    bobavery
    • RE: Virginia's IT outage doesn't pass management sniff test

      @bobavery

      Not VAIT's fault unless they didn't specify redundancy in the contract. It is more likely that NG "thought" the system was "redundant enough", and this failure was an untested scenario.

      For a state installation, I would think the minimum should be 2 geographically separately mirrors for the data. then, who cares if one fails.
      brichter
      • RE: Virginia's IT outage doesn't pass management sniff test

        @brichter Agreed...regardless of the hardware failure, Virginia is still on the coast (last time I checked) and within hurricane distance (making separate data center locations a no brainer).

        Most likely scenario was the bureaucrat not asking the tough questions and pushing for the low-ball price, quality be da*ned.
        bobavery
      • RE: Virginia's IT outage doesn't pass management sniff test

        @brichter you got that perfectly right :-) An Asynchronous mirror to another geographically dispersed Array should have been in place, constantly mirroring the LUNs on the array. Apart from that, local LUN snapshots using SnapView and SAN Copy, copied to another array on the same site, or Snapshots presented to a backup server not to mention the countless possibilities when all the connected hosts would have been virtual servers. This was really a very lousy implementation.
        DaHess
      • RE: Virginia's IT outage doesn't pass management sniff test

        @brichter

        Exactly! Lets be real and honest, there is nothing out there that is 100% and being down for more then 6 hours is not what I call a "good" thing nor should it ever be. I have worked in the WAN world for over 20 years with various companies and while there were outages they did not last days; in fact they lasted only minutes for the most parts. In those environments everything is redundant, Switches, servers, power supplies in every piece gear as well as redundant control boards in each piece of gear and yes there was also a mirrored site, just in case.
        These days doing that is cost prohibitive and requires more training of personal but it worked extremely well. We are indeed becoming more dependent on technologies and Too many young engineers have gotten lazy and so have everyone else.
        Still I question the ability of the State staff to run a test on this system, I have also seen many vendor testing procedures prove nothing but that they can pass a test but not the real event.
        NelsonVe