The (traditional) disaster recovery plan

The (traditional) disaster recovery plan

Summary: all of the plans I've reviewed have hadone thing in common: a lack of testing, or even testability,under realistic conditions.

SHARE:
TOPICS: Outage
12
Aka the business continuity plan and rather less well known as the "risk action plan," this is a document whose existence and table of contents are subject to audit - but which, like most data processing control artifacts, doesn't have to bear much resemblance to reality. In theory, of course, it does: after all the primary control, the user service level agreement, The plan won't apply to what actually happens... will specify how long data processing has to bring a list of critical applications back on-line, and the risk action plan documents are intended to describe just how those commitments will be met.

Unfortunately, all of the plans I've reviewed have had one thing in common: a lack of testing, or even testability, under realistic conditions.

In theory a disaster recovery plan consists of a list of possible disaster scenarios together with a proven method (including staffing and technology) for overcoming the consequences of each one. Typically, therefore, they'll start with a hypothetical event that closes or wrecks the data center, and then focus on who does what and where to bring a carefully prioritised list of applications back up as quickly as possible.

In reality, of course, the disasters rarely fit the scenarios, the people listed as responsible for each action are rarely reachable, and the senior managers who get rousted out when the brown stuff hits the fan usually throw the best laid plans into total chaos by overruling the rule book within minutes of arriving on site.

That disconnect between plans and reality is perfectly normal and people usually just muddle through, but the abnormal can be even more fun. Two favourite stories:

 

  1. this organization had its disaster recovery plans professionally prepared by high powered consultants from a major international firm. After several weeks of intensive effort they handed over a very pretty piece of work - the powerpoints were works of art and the embedded "emergency adaptive organisational call-out" process masterful.

    Everything had been considered, all contingencies covered - except that when an unhappy employee spent $29.95 for a butane torch at Home Depot, disabled the halon system, and then sloshed around some gasoline to really get those rack mounts running hot, it turned out that the only copies the company had of its disaster recovery plans were stored on those servers -along with the readers and encryption keys for the back-up tapes carefully stored off-site.

    Worse, the police closed the entire data center to all traffic for about ten days while they conducted their investigation and the health department refused access for another week because of the chemicals released before and during the fire.

     

  2. In the second case a government agency wanted to take control of its own data processing from the IBM dominated group providing government wide services. Negotiations having failed, local management invoked its right to opt out and hired the Canadian franchise holder for a large American consulting group to implement a non IBM mainframe solution (an HDS) complete with disaster recovery document preparation and appropriate staff recruitment and training, but compromised by agreeing to use the central agency's ultra-safe and temperature controlled vaults for off-site data storage.

    A few years later a contractor's employee working on the tunnel system two floors below the data center is thought to have unknowingly punctured a gas line sometime before leaving work on a Friday night. The inevitable happened early Sunday morning - turning that Hitachi into just so much shredded metal and taking the disks and on-site tape vault with it to some otherwise unreachable digital heaven.

    On Tuesday, messengers arriving at the central organisation's off site storage facility to pick up Thursday's tapes were turned away - and by late Wednesday local management had got the message: the central agency had put itself in charge of certifying disaster recovery sites, had not certified the Hitachi partner providing standby processing support for the agency, and "quite properly" refused to release the tapes to an uncertified site.

The bottom line message here should be clear: a formal disaster recovery plan of the traditional if this, then that style only makes sense if you can count on being able to control both the timing and the nature of the disaster - and doesn't if you can't. In other words the only things that are really predictable about data center recovery are that the plan won't apply to what actually happens, the recovery process will take longer and cost more than expected, and the whole thing will be far more chaotic and ad hoc then anyone ever wants to admit afterward.

So what do you do instead? that's tomorrow's topic, but here's the one word answer: drill.

Topic: Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • Well...

    How about putting the physical systems and the back up in the same room.... ;-) , if you're looking for trouble.

    Nice to know you're taking testing into account somewhere, Murphy ... Pitty it's at the end of the line..
    Arnout Groen
  • Scope problems

    Most DR plans are for individual servers or projects (and their servers). Rarely is the entirety of the data center considered. What happens when you migrate all of your data to IBM SVC, and it takes a dump? 200 different projects with ALL the same SLAs (30 day recovery) screaming at ULM (upper level management) to get their machines back up because they are critical (but weren't willing to pay for a better SLA).

    A DRP is not the sum of all of its parts.
    Roger Ramjet
  • In my experience...

    ... disaster recovery plans have been developed for a meteor collision similar to that which had a major effect on the dinosaurs, and for an atomic war which essentially eliminates civilization as we know it, except for the backup generators.

    These plans have been devised in such detail that I feel every confidence they will work. And better, should they fail, I do not believe that the reproaches by administrators will be excessively severe.
    Anton Philidor
    • Sounds familiar (LOL) (NT)

      <P>
      murph_z
      • As you'll also recognize, then...

        ... part of the plan was to store electronic records in underground facilities about as solidly reinforced as those protecting Federal officials.

        The general population is left to its own devices.

        If a situation occurs in which the only survivors are Federal officials and my old emails, I hope that the group includes enough young bureaucrats to repopulate.
        Anton Philidor
  • Drill is right

    You have to have management that is dead set on doing it correctly, i.e., committing the business and not just the IT division to it.

    The ultimate trial is to run the company from the DR site for a time. Every year I take a role doing just that, failing over Oracle/SAP systems to run at another site. You have to test: even when the procedures are thoroughly practical and battle-tested, there's always some entropic change in the environment that has to be accounted for. Failing to address that by regular checking and adjustment is a guarantee of failure.

    jcawley
    jcawley
  • I remember one story ...

    ... of a company getting in a consultant to audit the effectiveness of the newly written "ultimate" disaster recovery plan.

    The consultant turned up with a large box. He went through an overview of the plan which "covered all eventualities". He then opened the box which contained a large axe and said [i]"Righto gentlemen - let's go downstairs and trash some equipment with this axe. If your DR plan works you can send me the bill for the equipment, if not then you pay me my fees"[/i].

    He didn't get any takers - which said volumes about their "ultimate" DR plan.
    bportlock
    • I've been tempted...

      And I own an axe (not yet a prohibited weapon in Canada - but just wait until the liberals get in again..)

      But I've never quite had the nerve.
      murph_z
  • Ownership, drill, and review

    Your 'traditional' plans fail because the assumption is that a list of procedures equals business resumption. The procedures are lost, or forgotten, or not known to the people with the backup servers, data or location, and 'disaster recovery' gets tubed along with 15-35% of your customers.

    To succeed, the entire enterprise must be encompassed in a disaster recovery plan, and every functional manager must own a copy of that total plan -- including their own specific bits of it -- at some other location. Each manager has to be involved in writing the plan, establishing the priority of recovery efforts of which part of what critical processes -- which only they would know -- and they are required, at least once a year, to revisit their documentation and keep it updated.

    Because the entire enterprise will test the plan, by functional unit, throughout the year on a random schedule. At some point the boss is going to come in, turn the key, and the plant will go dark. "Right," he'll say, "what's the first thing you do?" The DRP is thus part of the job of each manager.

    Moreover, at least once a year all the functional managers get together and review the overall disaster recovery plan. Does it still make sense for the enterprise? Are there better solutions? What did the last test reveal and how do we incorporate the lessons?

    Done right, you can make the DRP a regular part of doing business and transition to it -- and back out of it to normal business -- as effortlessly as possible.

    Because you really don't know where or when the next disaster will come, or what it will look like. My own location suffered burst pipes over critical communications relays one year, a dropped carboy of ammonia in the blueprint shop another. All the equipment was fine -- but no one could get near them. If you didn't have a plan to keep the business going in another location on equipment whose terms of use weren't already spelt out, these would have been paralysing moments. People would have gone running looking for their DRP, or alerting the wrong management, or simply packed it in had there not been a DRP tried, tested, and known to all.

    If you rely on plans drawn up by 'experts' to live on the shelves of the facility you cannot enter, then you deserve the result. Own your DRP. Drill it into the managers and their direct reports. Review the plan and testing results regularly. Then you'll have a disaster RECOVERY plan and not simply a disaster plan.

    It's as simple as that.
    progan01@...
  • Penny wise, pound foolish managers

    who won't spend the money to provide backup sites and equipment, who don't want to run tests because it 'takes too much time' from the business, who don't do a good enough job of calculating true total business losses for lack of IT, or who don't listen to the experts when we tell them what they really need.

    It's a crying shame when even when you put your job on the line they still won't listen. Of course natural selection eventually takes over; hopefully after you've gotten a job in a more intelligent environment.
    Dr_Zinj
    • There's a more inteligent environment?

      Where, where? Are they hiring?
      murph_z
  • Wag-the-Dog

    Once again many misunderstand the "nature or the beast." It is not "disaster recovery," it is "business continuity," or BCP. No one can possibly anticipate every contingency that could disrupt a business. The business unit managers within an organization need to initimately understand their respective business processes to know what is needed to run it, how long can they be without it, what can be done manually and a what cost, and finally, what is the cost of being down. That last calculation alone will more than justify most, if not all of IT expenditures for the equipment/services needed to meet the business unit's needs. Basically, what you need is a user requirements specification BEFORE you (as an IT person) do any planning. That requirement spec is an initial "buy-in" from management to foot the bill. If the requirements, at logical business unit levels, are properly specified, really needed, and backed by management then upper levels of management can determine who gets what and when. With that being done, it is a fairly simple and straight forward process for IT professionals to put the right resources in place. But, as always, test, test, and test. srk
    srk-once@...