IBM pitches reliability for Regatta

Big Blue plans to bring back some old tricks to create its forthcoming Regatta server, the first of a new line of Unix servers due this fall. IBM says the new servers will provide mainframe-like features.
Written by John G. Spooner, Contributor
IBM intends to apply some old routines to make its new servers more reliable.

The company revealed several new technologies, under a blanket theme called "Who's on First" on Tuesday at Stanford University's Hot Chips conference.

The technologies will go into creating IBM's forthcoming Regatta server, the first of a new line of Unix servers due this fall, that the company says will provide mainframe-like features. It aims to compete with the likes of Sun Microsystems' forthcoming StarCat server.

IBM is shooting for the machine to provide mainframe stability on the Unix operating system.

"Applying the Who's on First technique, when something starts going bad...we can figure out who the cause was and what the effects were," said Joel Tendler, program director for technical assessment in IBM's Server Group. "Otherwise, you end up in the Abbot and Costello routine, going around in circles."

Regatta also borrows from IBM's Project eLiza, a set of self-management technologies designed to keep systems running in the face of errors by taking affected components offline.

The new approach combines eLiza ideas with new Regatta server features. The servers themselves are expected this fall.

It's no secret that IBM's forthcoming Power4 processor will power Regatta servers, in configurations of up to 32 processors. The chip, based on IBM's PowerPC RISC processor, combines two processor cores running at 1GHz or faster.

But IBM says the chip packs a number of new features that will help build a more reliable server.

The Power4 will allow, among other things, extra room in its Level 3 cache memory, so errors can be recognized and purged if necessary. The chip will also be able to bypass Level 2 or Level 3 cache, should the cache contain bad data. Similarly, the chip has been fitted with features that can alert applications to the presence of bad data.

At the same time, Power4 includes a feature that allows a processor to be decoupled from the machine, which will continue to run at reduced performance.

The idea behind the features is to increase the amount of uptime the server delivers to customers, Tendler said.

"To talk about reliability, you can't just talk about one component," he added. "You have to talk about the whole system."

As a result, Regatta will figure out who's on first with a number of new hardware-software reliability features, starting with support for error-correcting code memory.

Regatta will also be fitted with its own nervous system; a network of some 5,600 sensors that IBM says will monitor performance and overall system health.

The sensors "allow me to implement this reliability and bring availability of servers up to new levels," Tendler said. If there is a failure, the sensors "contain the failure as much as possible, so you can keep running".

Another new feature is called "PCI retry", which allows the server to retry an operation involving the PCI bus if it fails initially. Normally, a server would freeze and have to be rebooted, according to IBM.

Another feature--"memory scrubbing"--checks for errors in memory to note bad data and prevent its use or reuse.

Finally, a feature called "error safeguard" can sniff out problems early, preventing them from escalating. A database failure, for example, could be limited to a single process or a partition.

Though IBM is bringing many new features to Regatta, the idea of creating a mainframe-like server has been pursued for some time. Sun, for example, introduced new eight-processor Sun Fire 3800 servers last March that support a feature called partitioning, borrowed from mainframes. Partitioning allows a company to virtually divide a server into several smaller machines that can run independent jobs, such as testing software.

As competition among hardware makers such as IBM and Sun intensifies, it's likely they will bring additional technologies down the ladder to high-end and midrange servers.

"We're saying, 'Keep it working. Don't go down. And if you have a problem, minimize the impact,'" Tendler said.

Editorial standards