Over the past 2 months, I've seen an increase in the number of end user inquiries regarding high availability and almost more importantly, how to measure high availability (HA). HA means something different depending on whom you're talking with so it's worth a quick definition. I define HA as:
Focused on the technology and processes to prevent application/service outages at the primary site or in a specific IT system domain.
This is in contrast to disaster recovery or IT service continuity (ITSC) which is about preventing or responding to outages of the entire site.
Why so many inquiries about HA recently? I believe that due to our increasing reliance on IT as well as the 24X7 operating environment that companies of all sizes and industries are becoming more and more sensitive to application and system downtime. The interest in measurement is driven by the need to continuously improve upon IT services and justify IT investments to senior management, especially now.
So where to start? First, focus on the entire IT service, not just the individual infrastructure components. Availability is the result of the aggregation of all the availability factors of all architectural components supporting the IT service. Most components (networks, servers, storage, operating systems etc.) spec to 99.95% or 4.4 hours of downtime per year (based on 24X7). However, the combined service or IT system availability would fall below the 99.95% availability. Given the increasing reliability of all the components, most IT organizations do measure availability at 99.9% or above.
However, 99.9% availability is misleading, and it's misleading how some IT organizations report it. Is this for unplanned downtime or does it include planned downtime? Raw availability includes unplanned and planned downtime while adjusted availability includes only unplanned. Organizations often keep track of both. Also, when did the outages/disruptions to the services occur? Consider the difference between:
- 1 PM to 5 PM M-F; and
- Weekly outages of 30 min to 60 min at 4 AM local time or on the weekend
In many cases, timing and duration are more important than total downtime/outage. Is the 99.9% availability based on 24 X 7 hours of operations or business hours? If you were down 4 minutes last month, was it during business hours or over the weekend?
It's not as widely adopted as say incident management, but there is an ITIL availability management process. In ITIL v3, the suggested key performance indicators for availability management are:
- Availability of IT services compared to agreed upon service-level agreements
- Duration of disruptions to IT services
- Number of disruptions to IT services
- Number of infrastructure components with availability monitoring
Developing and agreeing upon the SLAs is going to be the toughest part but I think these KPIs are good starting point toward metrics that matter to the business. And while your organization is unlikely to spend huge sums of money on propietary fault-tolerant systems or high-end clustering solutions, there are cost-effective solutions that will provide a rapid restart of IT systems or leverage virtualization technologies. The HA discussion is no longer and all or nothing discussing, it's a discussion about providing a range of offerings that provide the required level of availability at a cost justified by the risk and cost of downtime.
Automating and measuring HA and ITSC will be a major focus of my research over the next several quarters. I'm very interested to hear from companies how they're approaching this today. What's working, what's not working?