Disaster recovery -- a checklist

Most companies will not survive an unplanned outage of critical systems exceeding four days. That'd get your back up

Real world computing accepts that things go wrong with technology. Whether it's a solitary disk read error or a disaster that turns your HQ into ash and rubble, a breakdown in systems can ruin your business -- while a sensible recovery plan may be the one thing that keeps it going. Here are the basic steps to creating such a plan and making sure it will work when needed.

Make disaster recovery an integral part of the way your business runs. Someone at the top needs explicit responsibility for overseeing the plan, as it is too easy to make dangerous economies when times are tough.

Prioritise the data and systems that need to be recovered first. Each department thinks that theirs is the most important, but the decision has to be made -- and that usually ends up with the IT department, which may not have the appropriate business insight.

Don't forget to look outside the data centre for things that need protecting. If employees have heavily customised desktops to do their work, how will it affect them if they have to start from scratch? Paper records are also always important.

Make sure you have redundancy for critical systems, whether it's a RAID storage system, server mirroring or even a complete duplicate data centre. There should be no one point of failure, including power supplies, telecommunications or even the office building itself, that will disrupt your business for any length of time. Most companies will not survive an unplanned outage of critical systems that exceeds four days, and even lapses substantially less than that can be disproportionately damaging.

Along with redundancy, backup is the most important part of disaster recovery. Once you know what you need to backup, decide when and how you will do your backups. A common scheme is to do a full backup at the beginning of each week, followed by deltas -- backups of changes -- at least daily if not more often. These can be differential backups, where the entire difference from the starting state is copied each time, or incremental, where the difference since the last backup is stored. Incremental backups take less time but produce more individual backups that have to be restored in order; with differential, you have just two restorations to make.

Offsite backups are essential, but difficult to manage -- especially for the smaller company. Where teleworking is common, it may be possible to automate the keeping of remote copies of information as part of the standard access arrangements. Whatever the backup process -- and floppy disks, CD-Rs, removable hard disks, tapes, leased lines and VPNs are all common -- ensure that access to offsite backups isn't dependent on just one person. It is common to duplicate the weekly backup and keep it offsite, and also to keep monthly backups.

Don't neglect security. If you need to make backups of sensitive information, is it adequately protected from attack if someone gets access to -- or steals -- the backup? Conversely, if you have a secure backup protected by encryption or severe access controls, is it possible to retrieve the information if key employees are missing?

Run regular tests to shake the bugs out of your plan -- and that means testing absolutely everything. Countless businesses have suffered because the regular backup procedure seemed to be working perfectly until the time came to retrieve information in earnest. Tests that produce no errors aren't tough enough: you're not testing to make sure it works, but to find out when it doesn't. This will also tell you if your recovery procedure is working but too slow or cumbersome -- a system that comes back but takes two days to rebuild may be inappropriate. Deciding backup and restoration strategies should be part of the initial architectural planning of any major system and should influence bus types, storage devices and the segmentation of the network.

When your business processes change, reassess your plans. An acquisition, new operating system installation or reorganisation can trigger this. Also, when you change an underlying system and migrate data over make sure you can recover to the old system for as long as may be necessary -- it's no good having old data you desperately need if you no longer have a system that will read it.

Make sure your critical suppliers also have strict disaster recovery plans. There's no point in having your data in the hands of a company that is itself struggling to get back on its feet after a problem. And keep your own dull stuff up to date -- lists of employees with addresses and mobile phone numbers, supplier contacts, and making everyone's role in the recovery plan part of their basic training.

Have your say instantly in the Tech Update forum.

Find out what's where in the new Tech Update with our Guided Tour.

Let the editors know what you think in the Mailroom.