Disaster avoidance and recovery

Disaster avoidance and recovery

Summary: bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits

SHARE:
TOPICS: Servers
9
It's my belief that almost all commercial and personal IT deployments both should and can be planned and implemented to minimise the risk of both processing interruptions and data loss. Sadly, however, it's my experience that most people simply don't do the job.

If you've got nothing more to worry about than a personal PC or Mac, Bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits. that's kind of a non issue - you make the effort, or you don't: it's your choice, your money, and, eventually, your loss.

But it's not your choice, your money, or your loss as soon as you become responsible for someone else's data or job. Put your local political constituency association's books on your machine and you better have a safe back-up -and one you routinely test at that.

Everyone I know in IT has a favourite story or two about how not to do this - here's a personal favourite, a pair of "Dear Member" emails from CIPS (the Canadian Information Processing Society, an organization I had long since resigned from in protest over their Windows only website policy) that together illustrate many PC related management problems:

 

Date: Mon, 12 May 2003 13:07:23 -0400
From: aff@cips.ca
Subject: Break in Communication

In the early hours of May 8, 2003 there was a break in and entry at the CIPS National Office. Two servers and one computer system were stolen.

One server was used as our mail server and contained cips.ca email and mailing listserv addresses. The second server hosted section and provincial web sites as well as, membership reports that are used by the Sections and Provinces. The second server also had a back up drive installed on it. The membership database is backed up overnight and the back up tape was in the server at the time it was stolen. The membership server that houses the membership system was not stolen.

 

The second one said (among other things):

 

After further review I am now in a position to verify with you that the on-line membership renewal process is a secured process. Any credit card information provided is encrypted. This is different from what was reported yesterday.

While the missing back-up tape is not readily accessible, members who selected the automatic annual renewal process potentially remain at risk in having their credit card numbers compromised. We will be attempting to contact these members directly.

I've never been able to decide whether this was risible or tragic - but the phrasing about the stolen backup tape not being readily accessible would tip the balance toward the lol response if the reality that this kind of thing is dirt common didn't tilt the issue toward tragedy.

So what can you, or anyone with responsibility for data and applications, do? There are three steps to the magic solution:

 

  1. believe that a disaster will happen, will be total, that all of your applications will turn out to be about equally critical (not to mention inter-related), and that this will happen while you're mountain biking in Tahiti; your acting head will be hospital, jail, or simply terminally drunk; and the big boss will have some ego involving, application dependent, and time critical deal going down just when the system goes dead.

     

  2. consciously decide to either live with the certainty of disaster or set up the entire infrastructure - machines, power, applications and staff- to have at least two completely separate, and completely redundant, systems with automated near instantaneous fail-over between them.

     

  3. hold drills: randomly pull the power plugs out of key machines or order everybody out of a data center "right now" and then cut off its power.

The technology part of this is much easier than it used to be - it just costs money.

The hard part here is carrying out realistic, no warning, drills. Try it, the next time someone brags to you about their failsafe, mutually redundant, fully clustered system, reach for the power plug and watch the panic set in.

But here's the real bottom line on disaster recovery: bloody drills under realistic conditions lead to bloodless, automated, processing continuity when the real thing hits - and nothing else works.

Topic: Servers

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

9 comments
Log in or register to join the discussion
  • Disaster Recovery or High Availability?

    DR and HA have different requirements - you were "mixing metaphors". DR is only concerned with recovery - and timing does not necessarily mean immediate. HA means no disruption to running processes. A DR plan could be as simple as keeping backup tapes in a safe location. A HA system needs redundant everything - CPU, disk, network, and location.

    So if the requirement of your DR plan was to have complete backups so that you could rebuild a machine in say 24 hours, then pulling the plug on that server would not be greatly appreciated.
    Roger Ramjet
    • Ouch, and yes, my mistake

      I guess I should call the right approach
      disaster avoidance - and then claim, (as I do this Friday) that DR isn't relevant anymore.
      murph_z
  • Sorry Murphy

    but:

    " Any credit card information provided is encrypted"

    There are some people out there in the real world who break codes between breakfast and their first cup uf coffee..

    And I'm not writing that this happened with the code mentioned in the email that you've got, but you'd never know
    Arnout Groen
    • I think, that it wasn't encrypted

      except during https transmission.. and/or that the
      keys were on the PC.
      murph_z
  • But when you pull the plug...

    ... and the machinery turns off, you will end much boasting about machinery whose uptimes are measured in decades.

    Perhaps the enterprises which come closest to following your advice run Windows. In that case, we would have a reason for the occasional interruption on machines running Windows, no?!

    There are no doubt many who could say that they turned on a Windows machine on their first day of work and that they saw it powered down for the first time on the day of their retirements as a tribute.

    But the forward-thinking nature of many Windows-only shops means they are the most likely to drill appropriately for disaster recovery (or high availability).

    Finally, a credible explanation for why Windows machines have been turned off. Thank you.
    Anton Philidor
    • Are you for real?

      [There are no doubt many who could say that they turned on a Windows machine on their first day of work and that they saw it powered down for the first time on the day of their retirements as a tribute.]

      Mike Cox you're not.
      Roger Ramjet
      • You scoff?

        The point of the folderol is that many people would not enjoy testing disaster recovery (or HA if you like) because of the disruption created, in their own sense of order as well as in the normal course of the day.


        Similarly, there are people whose sense of rightness would be diminished if testing proved Windows not to be as problematic as they have come to believe. They avoid any such testing because of the possible interference with a comfortable belief.

        Your reluctance to make a single accusation against Windows in the absence of personal experience shows that not everyone is strongly attached to attitudes, but others are not as rigorous, as you know.


        One should never underestimate people's determination not to interfere with something that gives them comfort and satisfaction.
        Anton Philidor
        • What I will give you

          Every Windoze product introduced PRIOR to W2K3 were pathetic in terms of uptime/reliability. W2K3 comes close to where UNIX has been for the last 20+ years. We have not had the same problems with our Windoze servers since W2K3. There is STILL the threat of viruses, but it seems that M$ has finally created a working OS.
          Roger Ramjet
    • True

      "Perhaps the enterprises which come closest to following your
      advice run Windows."

      Any enterprise relying on windows infrastructure would have a
      basic DR plan in place, the regular failure of the OS demands it;-)
      Richard Flude