Updated: It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state's IT systems?
Virginia's IT infrastructure, which is managed by Northrop Grumman, has led to a few statements from agencies. Notably, Virginia's Department of Motor Vehicles hasn't been able to process requests for licenses and ID cards. These systems are supposed to be up and running on Tuesday, six days after the outages started to appear.
Meanwhile, the Virginia Information Technologies Agency (VITA) said in a statement that teams have been working throughout the weekend to restore data. In a nutshell, the IT infrastructure of the state of Virginia was reportedly crushed by an EMC storage area network failure. Specifically, the EMC DMX-3 was behind the hardware failure. The Richmond Times-Dispatch reports that several systems are still down. The same paper said that Northrop Grumman will have to pay a fine for the failure. And the real kicker is that recently revised its contract with and extended the deal for three years. The state paid an additional $236 million for better service from Northrop Grumman.
Needless to say Virginia residents aren't pleased. We've received a few emails and calls and the comments on the Richmond Times Dispatch site are summed up by this one:
Highlights of the Revised Contract
Consolidates and strengthens Performance Level Standards with a 15% increase in penalties across the board if Northrop Grumman fails to perform on clearly identified and measured performance standards. - PAY-UP
Improves Incident Response teams to determine technology failures and expedite repair - FAILED
Institutes clear performance measurements for Northrop Grumman that agencies can easily track - FAILED
Adds new services to contract such as improved disaster recovery and enhanced security features - FAILED
Among the key parts of the VITA statement:
- Successful repair to the storage system hardware is complete, and all but three or possibly four agencies out of the 26 agency systems have been restored. Agencies continue to perform verification testing.
- Progress continues, but work is not yet complete for the three or four agencies that have some of the largest and most complex databases. These databases make the restoration process extremely time consuming. The unfortunate result is the agencies will not be able to process some customer transactions until additional testing and validation are complete.
- According to the manufacturer of the storage system (EMC), the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time.
The official explanation for the outage leaves a bit to be desired and frankly doesn't pass the sniff test. The outage was blamed on the failure of two circuit boards installed and maintained by EMC. EMC said it couldn't comment on the outage and the state of Virginia and Northrop Grumman were taking the lead on messaging.
Simply put, it's a bit disconcerting that two circuit boards can bring down a state's IT infrastructure for nearly a week. Talk about a lack of redundancy.
Among the things that don't add up in the Virginia IT outage:
- Why wouldn't these boards be replaced quickly?
- Why was there a single point of failure?
- According to the Washington Post, service was restored for 16 agencies, but 10 require "a lengthy restoration of data." Where was the disaster planning? After all, Northrop Grumman touted its disaster recovery for the state just two years ago.
- Where did the IT management fail?
We're told that Northrop Grumman knows about its IT management issues and is working on correcting the problems. Northrop Grumman was awarded a $2.3 billion IT services contract in 2005. And the company has touted some of the state's successes. Meanwhile, Northrop Grumman even relocated to Virginia. Hopefully, that proximity will lead to better IT management.
Update: There are a lot of good comments in the talkbacks below, but one IT worker in the U.S. government sent me one via email that captures the questions about this incident well. He wrote:
What has happened to multiple printed copies of disaster recovery documentation & SOP's (Standard Operating procedures)? What has happened to mandatory random data restoration, accessing & printing that data? What has happened to viewing IT Sales vendors the same way an intelligent person looks at a used car salesman? What has happened to triple redundancy on multiple hardware & software platforms? I have worked for both D.C. and U.S. Federal government for over 20 years and we have never lost any data. Why? As above mandatory random data restoration to check data backup hardware, software & most especially whatever medium is chosen: disc to tape or disc to disc with printed reports from that data time stamped and signed by multiple teams that verify the data backups and restoration and is part of there performance reviews. Someone wasn't doing test data restoration and verification from multiple discs at random intervals. Yes it costs a little more to do this but I bet it isn't even .001% of what this multiple day outage has cost.
When these IT backup vendors come onsite, have guts, stand up for yourself & demand they show you real time data restoration and verification with data time stamped, printed & checked from these data restorations.
Don't be dazzled by backup speed, it’s not worth a darn if you can’t retrieve the data.
Its all about being able to RESTORE and VERIFY data from your backup medium and being able to access, use and print it in as timely a manner as humanly possible. One of your family members or loved ones at some point may depend upon RESTORATION and VERIFICATION of data for there very lives especially if it is Health Care or Transportation data to name just a few.
Update 2: The state of Virginia has released the following statement.
STATEMENT FROM VIRGINIA SECRETARY OF TECHNOLOGY JIM DUFFEY
5 p.m., Monday, August 30, 2010
On Wednesday, August 25, at approximately 3 p.m., the Commonwealth of Virginia experienced an information technology (IT) infrastructure outage that affected 27 of the Commonwealth’s 89 agencies and caused 13 percent of the Commonwealth’s file servers to fail. The failure was in the equipment used for data storage, commonly known as a storage area network (SAN). Specifically, the SAN that failed was an EMC DMX-3.
According to the manufacturer of the storage system, the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999 percent and no similar failure has occurred in more than one billion hours of run time. A root cause analysis of the failure is currently being conducted.
The storage unit has been repaired and we have been in the meticulous process of carefully restoring data since the failure. This is a time-consuming process that requires close collaboration with the impacted agencies, especially those agencies with large, complex amounts of data.
Twenty-four of the 27 affected agencies were up and running this morning. However, three agencies are not yet fully operational. These agencies are the Department of Motor Vehicles, Department of Taxation and the State Board of Elections. Other agencies continue to experience minor issues.
The DMV was heavily impacted by this hardware failure and has been unable to process in-person driver’s licenses or ID cards at its 74 customer service centers. Please keep checking the DMV website for updates concerning this situation. We understand that this is a great inconvenience for our citizens and we are doing everything in our power to restore service as quickly as possible.
Teams and staff from the affected state agencies, the Virginia Information Technologies Agency, Northrop Grumman and EMC continue to work around the clock to correct the situation. I have confidence in the teams that are working on this problem. They are aggressively executing our recovery plan and are working tirelessly to restore all the affected agencies to a fully operational status. They have made significant progress and continue to do so. I ask for the continued understanding and patience of state employees and citizens as this work continues.