Stratus: The real cost of fault tolerance

In April 2007 I posted Fault Tolerant and Fail Over is There a Difference?. In that post I explored the differences between a failover environment and an environment that can not appear to fail.

In April 2007 I posted Fault Tolerant and Fail Over is There a Difference?. In that post I explored the differences between a failover environment and an environment that can not appear to fail. Here's a segment of that post.

All of these are well and good. What happens, however, when the requirement is that failures are never seen? This is the realm of FT systems. In this case special purpose, redundant hardware configurations are deployed that are run in lock-step. If one component of the system fails, the other continue working and the application does not fail. Historically, FT solutions were quite expensive. After all, every component of the system had to be replicated enough times to handle all expected failure scenarios. More recent solutions, offered by suppliers such as Stratus and Marathon, are based upon industry standard systems and components. The use of off-the-shelf hardware significantly reduces the price of these solutions.

Later, I published another post Conversation with Marathon Technology that summarized an interesting conversation I had with the company. Here's a snipet from that post.

Fault tolerant hardware typically was more costly than general purpose systems having similar processor, memory and storage configurations because every component was duplicated at least once. If an organization stood to loose enormous amounts of revenue due to a failure, it would purchase these systems regardless of the cost of the hardware. Since the cost of these configurations was high and these systems had to be treated as a single computer, many organizations turned to other types of virtualization, such as clustered systems, when their need for constant availability was not quite as high. While clusters took longer to deal with a "state change", the systems involved could all be productive rather than being treated as merely a "hot" backup. Marathon's everRun™ FT creates a true fault tolerant environment using general purpose industry standard systems connected by Gigabit Ethernet. This means applications hosted in an everRun environment do not see failures. Processing "fails through" to remaining resources when something fails. Marathon is supporting Windows-based applications today and will support Linux-based applications in the future.

I wanted to speak with someone from Stratus Technologies immediately after my conversation with the folks from Marathon Technologies. Unfortunately, that was not to be. The representative of Stratus Technologies is in the UK and it took some effort to find a time that would work.

I spoke with Andy Bailey, a representative of Stratus Technologies, yesterday. It was fascinating to say the least. Thanks, by the way, for being so generous with your time, Andy. During this discussion we covered quite a bit of ground. Here's a brief summary:

  • Are fault tolerant machines too expensive? — although the initial acquisition cost is a bit higher for truly fault tolerant machines, if one considers the real costs of running a system for 3 or 5 years, a fault tolerant machine could actually be less costly than maintaining multiple independent machines, clustering software and virtual systems management software. Andy pointed out that using a fault tolerant machine would, in all probability, reduce management costs as well as reducing the chance that the organization would lose revenue due to either planned or unplanned downtime.
  • What type of applications experience the greatest benefit from a fault tolerant infrastructure? — Gaming, Financial, Manufacturing, Military, Public Safety, Telephony and any other application that simply can not be seen to fail.
  • Any trends leading to the adoption of fault tolerant systems? — as organizations increasingly turn to virtual machine software or partitioned operating systems for workload consolidation, fault tolerant systems look more and more attractive. After all, if the organization is putting a large number of its "IT eggs" in one basket, that basket had better be very sturdy.
  • Why is a system from Stratus a better choice than installing industry standard systems and linking them together with clustering or some other form of virtualization software? — A system designed from the ground up to be fault tolerant would offer better performance. Andy speculated that other approaches to offering true fault tolerance on industry standard systems would have a significant amount of overhead, that is something has to keep things working in lock step. While the system is doing that, it's not working on the organization's applications. Hardware-based fault tolerance does not impose that type of overhead. It also is considerably less complex from a software point of view. Fewer moving software parts usually results in higher levels of reliability. Other solutions can be made to do the job but, it is likely that the "fail through" time will be greater and organizations would need to acquire expertise in clustering software, management software for virtualized environments and the like.

Marathon Technologies is saying many of the same things and is offering fault tolerant systems based upon off-the-shelf industry standard systems. It will take some time for me to fully digest everything I've heard. Does your organization use either Stratus or Marathon Technologies as part of the IT infrastructure? Why would one approach win out over the other?