Disaster avoidance and recovery

bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits
Written by Paul Murphy, Contributor
It's my belief that almost all commercial and personal IT deployments both should and can be planned and implemented to minimise the risk of both processing interruptions and data loss. Sadly, however, it's my experience that most people simply don't do the job.

If you've got nothing more to worry about than a personal PC or Mac, Bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits. that's kind of a non issue - you make the effort, or you don't: it's your choice, your money, and, eventually, your loss.

But it's not your choice, your money, or your loss as soon as you become responsible for someone else's data or job. Put your local political constituency association's books on your machine and you better have a safe back-up -and one you routinely test at that.

Everyone I know in IT has a favourite story or two about how not to do this - here's a personal favourite, a pair of "Dear Member" emails from CIPS (the Canadian Information Processing Society, an organization I had long since resigned from in protest over their Windows only website policy) that together illustrate many PC related management problems:


Date: Mon, 12 May 2003 13:07:23 -0400
From: aff@cips.ca
Subject: Break in Communication

In the early hours of May 8, 2003 there was a break in and entry at the CIPS National Office. Two servers and one computer system were stolen.

One server was used as our mail server and contained cips.ca email and mailing listserv addresses. The second server hosted section and provincial web sites as well as, membership reports that are used by the Sections and Provinces. The second server also had a back up drive installed on it. The membership database is backed up overnight and the back up tape was in the server at the time it was stolen. The membership server that houses the membership system was not stolen.


The second one said (among other things):


After further review I am now in a position to verify with you that the on-line membership renewal process is a secured process. Any credit card information provided is encrypted. This is different from what was reported yesterday.

While the missing back-up tape is not readily accessible, members who selected the automatic annual renewal process potentially remain at risk in having their credit card numbers compromised. We will be attempting to contact these members directly.

I've never been able to decide whether this was risible or tragic - but the phrasing about the stolen backup tape not being readily accessible would tip the balance toward the lol response if the reality that this kind of thing is dirt common didn't tilt the issue toward tragedy.

So what can you, or anyone with responsibility for data and applications, do? There are three steps to the magic solution:


  1. believe that a disaster will happen, will be total, that all of your applications will turn out to be about equally critical (not to mention inter-related), and that this will happen while you're mountain biking in Tahiti; your acting head will be hospital, jail, or simply terminally drunk; and the big boss will have some ego involving, application dependent, and time critical deal going down just when the system goes dead.


  2. consciously decide to either live with the certainty of disaster or set up the entire infrastructure - machines, power, applications and staff- to have at least two completely separate, and completely redundant, systems with automated near instantaneous fail-over between them.


  3. hold drills: randomly pull the power plugs out of key machines or order everybody out of a data center "right now" and then cut off its power.

The technology part of this is much easier than it used to be - it just costs money.

The hard part here is carrying out realistic, no warning, drills. Try it, the next time someone brags to you about their failsafe, mutually redundant, fully clustered system, reach for the power plug and watch the panic set in.

But here's the real bottom line on disaster recovery: bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits - and nothing else works.

Editorial standards