In this issue of Industry Insider, Weight Watchers Australasia's Neil Lappage shares the finer points of developing and executing a disaster recovery plan.
A plan's not a plan until it has been tested...that's my theory anyway.
In a previous job, I was the IT manager of the second biggest independent marketing company in the UK. On-site with my internal servers I hosted around a terabyte of mission critical client data for some of the biggest companies in the UK. Each client server hosted multiple SQL databases with little variance in configuration thus making the appearance of disaster recovery (DR) easy.
After joining the company, I worked hard on maintaining the stability of the network and servers and continually added redundancy and fault tolerance where necessary to try and avoid minor disasters -- knowing where to stop with fault tolerance it also important but usually budget driven. The levels of fault tolerance included disk mirrors, hot redundant servers, network card and power supply redundancy.
After winning the battle of gaining respect and trust with my user base, who had been neglected by poor support prior to my arrival, I went about writing the company's DR plan. For this project, I set aside a two month period and took on the role of project co-ordinator and tester.
As the only IT resource on-site, I had to fire fight the day-to-day support and systems requests yet find time to work on the DR project.
As an IT manager walking into an environment where the level of internal support was at such a low, one of my goals from the outset was to gain stability of all systems and then to prove that the backup systems actually worked -- failure to deliver the goods would have meant an early exit.
The key to disaster recovery is thorough testing and quality documentation. If a disaster happens you can be sure that it will happen a long time after your testing so the dependence on good quality documentation is key since it will be referred upon heavily. In addition, there's no good writing a plan which cannot be read when disaster strikes so I always keep a printed copy at my off-site data storage location. I mention this since I think many overlook it.
It's too easy to become dependent on the successful messages which you see reported in backup programs on a daily basis to know that your data is restorable. I always set aside a block of time annually or bi-yearly to test my plans and backup tapes.
In the circumstance at this company, testing was the key to my survival. I set-up a test environment and then worked my way through restoring each server and documenting the steps required to restore a system in a bare metal disaster recovery situation where the building and all associated infrastructure are lost or severely damaged. During testing I discovered that was unable to restore my Windows 2000 Global Catalog server, and since I only had one GC in my LAN this would have been catastrophic had we lost the system. The residual effects of this downtime would have affected our mail system and there would've been a considerable amount of fireworks if management couldn't access their e-mail.
The two key measurable's of a successful DR plan are the Recovery Point Objective (RPO) and Recovery Time Objective (RTO), being the point which you can restore back to, and the time which it will take to complete a restore respectively. Before going about any such testing, it's always good to liaise with the users and find out what their expectations are. Once you have this information you can form a RTO which is achievable. I use the word "achievable" because it does also need to be realistic.
Since system crashes can vary in nature I always build the RTO around a scenario -- this one being that we had lost the building and all new infrastructure, it was lunchtime on a Monday and we had co-located to our spare building. This scenario I find is the average toughest time for any internal infrastructure, it's the beginning of the week and the pressure is on to get systems up and running as soon as possible.
Once I finally found the solution to re-installing our Windows 2000 Active Directory Services, Exchange server and other services, we found that the restore time was not acceptable and did not fall in line with our RPO which was ultimately driven by our user base.
The recovery procedure also proved to be so fiddly we couldn't rely on it in a real situation. Because of this, I turned to a hard disk cloning software package. This package allowed us to clone boot partitions which are part of a RAID array on a SCSI hot plug back plane. The package we used was Powerquest V2i Protector. I was able to reduce my restore time from around 14 hours to 4 hours, the actual restore time to install the server was 40 minutes. The remaining waiting period included how long it would take for the new server to arrive and for our tapes to be delivered from our off-site storage.
What makes this particular product so good is that you can take snap shots of the system while the server is being used, including important system files which would be open. It is possible to copy backups to SAN and NASs but we found the best solution to burn the image to a externally attached CD burner and then write this image from the CD to the new server in a disaster situation, depending on the size of your infrastructure solutions will vary.
That's all water under the bridge now and I lived to tell the tale. As a matter of priority I always ensure that DR is fresh in my mind whereever I go. It's easy think that once you have completed one testing phase you never have to do it again, but think of it as survival -- not undertaking further testing in future could be the last DR decision you ever make!
(Editor's note: Powerquest was recently aquired by Symantec.)
Neil Lappage is Weight Watchers Australasia's IT manager.
If you would like to become a ZDNet Australia guest columnist, write in to Fran Foo, Editor of Insight, at email@example.com.