When organizations put a lot of eggs in one basket

Virtual processing and fault tolerant systems are a good match.

When organizations jump into virtual processing environments, that is environments that offer a spectrum of capabilities ranging from making many machines look like a single computing resource or a single machine look like many, it is wise to match the machine(s) utilized to the business requirements for availability and reliability. When many workloads are deployed on a single computer, that computer must be selected to meet the organizations availability requirements. There are, of course, a number of ways to meet those requirements. Let's look at them, shall we?

Clustering

Cluster a number of machines together and then improve availability and reliability by using virtual machine movement technology combined with automation and orchestration software. This approach may meet the minimum availability requirements, but it also means that the organization must purchase multiple physical machines, a storage virtualization solution, the VM motion technology, the management software that determines if service levels are being met and automation/orchestration software that moves things around to meet those objectives. This approach can be complex, require a number of different types of expertise and may not offer sufficient levels of availability.

Fault tolerant systems

Another approach is to purchase special purpose machines that have been designed to be fault tolerant. That is, the failure of a single component (or in some cases many components) will not cause a workload failure. These machines are more expensive than a single traditional machine, but if all of the other costs are included (costs for additional operating systems, hypervisors, management tools, VM movement tools and automation/orchestration software and the like) they can be significantly less costly. They also create a much simpler, easier to manage environment.

Which to choose?

Both of these approaches are useful. If very high levels of availability and uptime are required the hardware-based approach, that is deploying fault tolerant computers, may be the best choice. If a workload is important, but not critical to the organization's survival, using clustering might be an acceptable choice.

Snapshot Analysis

It appears that Stratus Technologies and NEC Corporation of America are battling to be the supplier of fault tolerant systems of choice.

NEC recently launched a new system, the NEC Express5800 R320a FT, and claims this server provides 99.999 percent uptime for mission critical apps in both virtualized and on-premise data centers.

Stratus Technologies has long offered hardware and software availability solutions.  The hardware systems can offer up to 99.9999% uptime. The software solutions offer slightly less than that. They're so proud of the level of availability being experienced by their customers that they post that number on their website.  At this moment, Stratus customers are experiencing 99.99989% up time.

Those needing this level of availability would be wise to examine the offerings of both suppliers.