Levels of availability and disaster recovery

I visited a Kusnetzky Group client, Racemi, several weeks ago. We had a fantastic discussion of disaster recovery and all of the different approaches organizations may choose to deploy to achieve the level of availability needed for their workloads.

I visited a Kusnetzky Group client, Racemi, several weeks ago. We had a fantastic discussion of disaster recovery and all of the different approaches organizations may choose to deploy to achieve the level of availability needed for their workloads. Racemi, for those who don't know the company, offers the DynaCenter family of products that are targeting cost-effective automated rapid server recovery - even on dissimilar hardware.

How much is enough - levels of availability

Availability is a good thing, after all, automated work only gets accomplished if the IT infrastructure is working and available. (as my Grandson would way, "Well Duh.")

As is always the case with IT-based solutions there are several ways to look at this concept. One is based upon what percent of uptime is required to fulfill an organizations needs. Let's look at percent uptime and then calculate what downtime that would be experienced.

 Level of availability Downtime/Year
90% 36.5 days
95% 18.25 days
99% 3.65 days
99.9% 8.76 hours
99.99% 50 minues
99.999% 5 minutes
It is pretty obvious that 36 days of downtime would be unacceptable for some workloads. Some workloads can never see a failure or the organization might face significant loses. On the other hand, too much availability can be too much of a good thing. As an organization increases the level of availabiltiy, they also significantly increase the costs of both hardware and software. The level of complexity in the datacenter increases dramatically as well.

What's a person to do?

Today's systems, storage and networks are pretty reliable (if the power remains available). So, achieving 90% availability may not require a great deal of additional software or hardware.

Moving up to 95% availability is likely to require system or application clustering combined with redundant hardware. It may also be possible to meet this level of availability with redundant systems combined with something much more simple - backup/archiving/disaster recovery tools made available by a number of vendors (including Racemi of course).

Moving up the next step, to 99% availability, almost always requires redudant systems, power supplies, storage and networking systems. Some form of multi-system clustering is an absolute requirement at this point. There are many ways to attack this problem - clustering (operating system or application level), virtual system movement tools (XenMotion, VMotion, Live Migration), or virtual system orchestration/automation tools (Cassatt, VMLogix, Novell, Scalent Systems, Surgient all play here).

Going much beyond that often requires complex planning and datacenter design. I would bet that your suppliers would just love to sell planning and implementation services to your organization.

What is your organization doing to make sure that applications are available when needed?