Business

Dodging data-centre disasters

What plans should you have in place to make sure your critical systems can cope with the unexpected?

Written by Jonathan Bennett, Contributor July 17, 2007 at 4:34 a.m. PT

If you have a data centre, by definition it's critical to your business. If you've done your job properly, the sort of small-scale mishaps all companies encounter from time to time shouldn't result in any real disruption to your business. Fault-tolerant, high-availability servers, redundant storage, duplicated communications lines into the data centre and backup power systems can all ensure continuity of service.

Accidental data loss through human error can also be planned for, and the data restored in minutes once the problem becomes known. Malicious actions by disgruntled employees are harder to deal with quickly, particularly if they have extensive administrative rights, but it's possible to contain the damage, and company termination procedures should be designed to ensure that, if someone is being shown the door, their access to any systems is suspended as part of the process.

Occasionally, however, a rare but truly disruptive event will come along that threatens normal resiliency plans — in other words, a disaster. A disaster is sometimes defined as an event that can be predicted, but not prevented — at least as far as an individual organisation is concerned. Once you've predicted that a disaster can happen, you need to get some sort of idea how likely it is that it will occur. Losing an entire data centre — functionally, if not physically — is a very rare event: we have few earthquakes in the UK, and they're rarely strong enough to cause disruption to utilities, let alone physical damage to buildings. Terrorism, while high on the news agenda at the moment, is still incredibly unlikely to affect you, especially if you're not located in central London. Research last year by Gartner showed that few companies are interested in high-level disaster planning of the kind that's needed to cope with this level of event.

Timothy Coats, business continuity practice lead for EMC Infrastructure Consulting, says that this kind of complete disaster is highly unlikely: "Generally, the loss of an entire data centre is a rare event. In many instances, a data centre will cease functioning. That's a more common occurrence."

Planning for the worst-case scenario
Disaster-recovery planning for a data centre cannot take place in isolation: it has to be part of an overall business continuity plan for the whole company. Coats believes the rest of the business needs to be involved in the decision-making process: "The responsibility of the chief information officer is to make known to the stakeholders the risk. The business should know where its vulnerabilities are. Unless you know, you're rolling the dice and closing your eyes."

If an event serious enough to take an entire data centre offline occurs, the chances are the business has been affected in other ways as well. There's no point in restoring a business-support function, like a data centre, if there's no business left to support, particularly if your data centre is co-sited with your core operations: a manufacturing company's ability to produce products may well be disrupted or destroyed in a disaster; service companies could have no viable office space left for people to work in and no way of acquiring any in a reasonable period of time. Worst of all, a true disaster may involve loss of life. Including this thinking in any disaster-recovery plan may be horrific, but it's necessary.

Sometimes the disaster may not even affect an organisation's own operations. "You need to plan for the recovery of the loss of other business services, such as a major supplier," says Coats. While IT systems have their part to play in enabling a quick switch of suppliers, this kind of event falls outside the scope of a data-centre disaster-recovery plan.

Decisions about disaster avoidance and recovery strategy are based on economics then, not what's technically possible. Of course, choosing the right technology to help the recovery from any problems is vital but, since the options range from simple data-recovery tools to a parallel-computing facility, deciding what's reasonable and prudent for the organisation to invest in is crucial.

The more a business relies on its data centre, the easier decisions become. For a company in the finance sector, downtime means losing millions of pounds an hour. In that case...

...spending a few million on a shadow data centre will pay for itself in just one short power outage. Services being unavailable for the amount of time it would take to rebuild the data centre is unthinkable for this industry.

The number of companies that can afford this type of backup facility is limited, though. For the rest of us, the cost isn't justified by the risk faced: a second data centre will never pay for itself. Coats says EMC's non-finance clients usually don't adopt this approach. "About 50 to 60 percent [of our clients] are in the financial sector. Of those not in the financial sector, about 20 percent have a second data centre," he explains.

There are alternative, less costly strategies that mean a longer downtime. These may not be necessary for every single system within an organisation. It may make sense to have a short recovery time for "only one or two critical systems that need shorter downtime", according to Coats. Gartner claims around 10 percent of applications are considered critical at present, although this figure is expected to rise to 25 percent by 2010, as businesses become more dependent on their information systems.

Restoring these crucial business services can be achieved far more easily than trying to recreate the whole data centre at once. Having just enough spare hardware available — in secure storage, for example — becomes more cost-effective. For less critical services, a company may be able to tolerate downtime long enough for new machines to be delivered. Checking the typical lead time for hardware vendors and investigating any fast-delivery services they may have will pay off if this is the case. Using virtualisation will also make restoring individual applications easier, as virtual machines can be deployed to whatever hardware is available in the shortest time.

Doing nothing is sometimes an option. For some companies, after making the risk calculations and figuring out the cost of disaster mitigation, it simply may not be worth spending the money for what it will achieve. The board's responsibility to the shareholders is to ensure the maximum return on their investment, not necessarily to keep the business going at all costs. If this means winding up the company and selling off any remaining assets after a disaster, then that's what they need to do but, as with any decision of that nature, it's one the board has to make.

Even if this is the company's plan, a certain amount of work still needs to be done to ensure that the receivers of the company have enough information to do their job properly: financial records need to survive the disaster, even if the business doesn't.

Rehearsals are key to successful recovery
Whatever your company's size and disaster-recovery strategy, you must plan and budget for one extra factor: rehearsals. Regular testing of your plan will not only ensure that everyone involved knows how the strategy will work in practice, but will also allow you to make changes in the plan to reflect changes in the business environment since it was first written.

IBM's Redbook on disaster-recovery planning recommends testing your procedures at least once a year, over and above regular testing of backups and spare hardware. It also points out that you'll never get a truly realistic test of your strategy, since, in the real event, some of your staff may not be available and those that are available are likely to be distracted and under greater stress than during a test.

The Redbook also recommends changing staff roles during tests — making your database administrator deal with network configuration, for example — to reflect what may happen in a real disaster, but also to give an idea of how much of your plan is truly documented and how much is held only in the heads of your staff. Document everything that goes wrong during your test, then adjust the plan accordingly.

Disaster-recovery strategy is as much about finance and business relationships as it is about technology. Not having a strategy is certainly negligent, but spending too much time and money on trying to plan for highly unlikely events doesn't do much for shareholder value either. A good data centre-recovery plan doesn't treat the data centre as an isolated, monolithic lump, but takes into account what services it provides to the rest of the business, and how the users of those services are likely to be affected both by downtime and the cause of the problem. Combining a sensible assessment of the likelihood of a disaster with a realistic set of targets should ensure your disaster-recovery plan doesn't become a burden and stays in proportion to the risks your company faces.

Editorial standards

Show Comments

Linus Torvalds and Dirk Hohndel, Open Source Summit North America 2024

Dodging data-centre disasters

Related

Linus Torvalds takes on evil developers, hardware errors and 'hilarious' AI hype

6 features I wish MacOS would copy from Linux

The best AI image generators to try right now