Why server users should be fault-intolerant

Peter Judge: Every server maker is focused on making Intel servers better datacentre citizens. Why not go all the way and make them fault-tolerant?

For more than 20 years, there has been a significant market for fault-tolerant servers. When businesses started to automate, it pretty soon became clear that some applications were so important that the company could not afford to have them fall over.

If you have to reboot a Las Vegas casino, it can cost $10m. If you have to reboot a stock exchange, it can cost $100m.

Typically, fault-tolerant systems use twice as many processors as the application needs, but runs them in lock-step. The same answers should come out of both halves of the system at all times. Any disagreement, and it is obvious one board had failed. A quick look at some checksums, and the system kills the one that failed, and alerts the IT manager with a request for a replacement board.

This made for expensive systems (think twice as much as a regular one), that could only be justified by very valuable applications. In the early days, the industry was prepared to support variegated hardware and software. Suppliers, such as Tandem (founded 23 years ago) and Stratus (founded 25 years ago), chose the processors they wanted and created their own hardware and software for the purpose. The systems sold well for jobs like running banks and telecoms services.

Things are different now. Hardware has become more uniform: there are a limited number of Risc processors with a future, and the majority of servers use Intel and standard operating systems. At the same time, conventional servers have become more reliable -- think of IBM's eLiza initiative, and the growth of clusters with failover.

Users would have to be very convinced of the merits of a fault-tolerant system to pay over the odds and buy a non-standard system. The server might run non-stop, but what if the supplier went out of business, or the range got canned? Whether the systems were from niche-y vendors, or from outposts of big companies, their future has looked unsure, and the market for those fault-tolerant systems has become less obvious.

In these tough years, the two front-runners in fault-tolerant systems all but disappeared, suffering in the late-90s epidemic of dodgy mergers. Tandem was swallowed up by Compaq, suffered some years of confusion, and is now the NonStop division of HP. After migrating to the Digital Alpha processor, the NonStop range is now in the process of moving on to Itanium, with all the implied upheaval to users' applications.

If anything, Stratus' prospects seemed even worse during this time. It got caught up in an even worse merger than Compag/Digital. It was bought by Ascend, which went on to be bought by Lucent.

Why did networking companies want those servers? Because Stratus was big in the telecoms markets, and therefore apparently key to the dot-com boom. So Lucent tore Stratus up, kept the bit that sold to telecoms, and hived the rest off.

Despite this kind of corporate dismemberment, Stratus claims to have thrived. It has made the next sensible step for fault-tolerant systems, which is to bring reliability to bear on the weakest link in the datacentre: the Intel server. Stratus recently reabsorbed the remains of its telecoms division from a chastened Lucent.

The fault-tolerant market that Stratus is aiming for is almost the opposite of the one it started in. Twenty years ago, there were specialised applications that needed to be reliable and could justify costly, specialised hardware. Now the applications that need to be reliable are generic, horizontal ones -- email servers and the like. People won't pay double for hardware to run them, and won't port apps to outlandish systems.

Stratus' idea is to run two Intel servers in lock-step, with error-checking. To the application, and the user, it looks like a conventional server -- albeit one that doesn't stop working. "We want our hardware to be comparable in price with a two-node failover cluster," says David Chalmers, director of products and technology EMEA for Stratus.

He moves on from there to that essential of tech business in 2003 -- the total cost of ownership (TCO) sale. If your systems won't save the IT manager money, you can forget about trying to sell them.

Stratus' hardware might cost 15 percent more, he concedes, but the effort in managing the server is vastly smaller. And -- good news -- software companies generally don't worry that a fault-tolerant box is running two copies of their software, so you don't have to pay a second licence fee.

Interestingly, Microsoft went with this arrangement very quickly -- perhaps because it wanted to add to its reputation for reliability -- while Oracle has only acquiesced in the last few weeks. Until then, running Oracle on a fault-tolerant box cost you twice as much for the same performance, because each of the lock-step processors needed a separate licence. Perhaps this had something to do with the "Unbreakable" campaign the company was running last year, emphasising how reliable Oracle was in the first place...

The problem for Stratus is that -- as we already noted -- conventional servers are getting more reliable. Systems like the IBM x440 have dual power supplies, and processor boards that can be hot-swapped. With massive marketing budgets for IBM's autonomous computing, Stratus and its ilk will have to work very hard to convince users they have a margin of reliability large enough to justify a price margin over those servers.

And they will have to fight even harder to keep that price margin manageable, thanks the industry standardisation. Everyone is using the same Intel processors, and bigger companies will be able to make it very hard for the smaller ones.

"The balance between small and big Intel players is controlled partially by the discounts," points out analyst Martin Hingley, vice president of IDC's European systems group. "I'm not sure how Stratus can compete against Dell and HP on margins for chips. Let's hope their clustering has a strong value propostion for users. We need the smaller players to bite the heels of the global ones."

I'd certainly like to think that smaller vendors with ideas can continue to do more than the big guys -- and if the net result is better behaved servers, then we all benefit.

More enterprise IT news in ZDNet UK's Tech Update Channel.

For a weekly round-up of the enterprise IT news, sign up for the Tech Update newsletter.

Let the editors know what you think in the Mailroom.