Testing a disaster recovery plan is something that most of us take for granted. We build the DR solutions, acquire the necessary hardware and software, configure the systems, and make sure all the backups and replication schemes are in place.
Then we test failover and DR response systems to make sure everything will work the way we planned. But after initial testing, most of us forget about the systems and assume everything will work when needed.
However, most organizations don't remain static. They deploy new hardware and software, and they hire new staff with different skill sets and skill levels.
In many cases, the staff that built the systems will eventually move on, leaving the DR plan in the hands of people who had nothing to do with its implementation. While you can train new personnel and examine new equipment for compliance, there's no guarantee your DR plan will continue to work as planned.
That's why you must continue to test it on a regular basis. But how do you make sure you're testing the right parts? And how can you be sure you're testing enough, without disrupting business more often than you need to?
In some cases, regulations governing your industry dictate what you must test and how often. For example, many financial organizations have regulations that require testing DR systems at least once a year and call for tests that consist of staff performing transactions that would occur in real business situations.
One drawback to many of these regulations is that they're not specific about how organizations should carry out such tests. You must design your own internal policies for this purpose.
If you don't have regulations to fall back on, or if regulations aren't specific enough for your needs, it's up to the company to determine how and what you test. There's a wealth of potential scenarios, but here are three common testing solutions.
- Isolation testing: Isolate DR systems from the rest of the network, and set up a few client machines for testing purposes. This enables you to test DR systems without impacting production. At the end of the test, remirror or otherwise overwrite this test data with copies of the live data from the production systems, making sure you're ready for the next test or an actual disaster.
- Live testing with test data: Actually fail over to the DR systems for a period of time during which no production transactions occur. Instead, process test transactions. Again, overwrite DR systems with production data from live systems after the test is complete.
- Live testing with live data: This is the most cumbersome of the testing procedures. Fail over your entire enterprise to the DR systems, and process live transactions for a period of production time. After completing the test, restore data from the DR systems to the production systems, and resume activity on production systems.
Which type of testing you perform is up to your internal and regulatory procedures, but all three types allow you to make sure--at the very least--that DR systems can perform the actions you intended them to. Failure to test on a regular basis could easily leave you with one or more data systems that no longer operate in the DR site, creating a whole new disaster to contend with when you thought you were safe.
Mike Talon is an IT consultant and freelance journalist who has worked for both traditional businesses and dot-com startups.