- System keeps running during almost all types of hardware failure
- Current VMware support has a few limitations, such as not supporting the use of internal hard disks
Launched in May, the Stratus ftServer 6210 sits at the high end of Stratus's range of fault-tolerant Intel-based servers. The ftServer6200 came with 2.6GHz quad-core processors, while the 6210 has 3GHz quad-core chips and an increased maximum RAM configuration 32GB (from 24GB). In January, Stratus released a bundle comprising the ftServer 6200 and a pre-installed copy of VMware ESX Server.
As VMware has no built-in cluster support, the Stratus VMware bundle is the only way to build what is in effect an active/active cluster for VMware ESX Server.
We tested the VMware bundle using an ftServer 6200 fitted with 8GB RAM, two internal SAS hard disks, an Emulex Fibre Channel host bus adapter (HBA) and VMware VI3. The price for this configuration was £26,015 (ex. VAT). We also tested the same hardware running Windows Server 2003 Enterprise Edition.
Like other servers in the Stratus range, each ftServer 6200 system actually comprises two identical sets of server hardware, each fitted in a 2U rack enclosure and linked by Stratus hardware that ensures if a component in one server fails, the other server will automatically pick up the workload. So although our review system is described as having two processor sockets, 8GB RAM and one HBA, the whole system actually had four CPU sockets, 16GB RAM and two HBAs. Local storage is configured as a set of mirrored disks, with disk 0 from one half of the server pair mirrored to disk 0 in the other half. SAN storage is assumed to be highly fault tolerant in its own right and so is not mirrored in this way. With or without SAN, the result is a high-availability server system that does not rely on complex clustering software.
Enterprise Editions of Windows Server include Microsoft Cluster Service and so could be clustered for high availability. However, the Stratus approach has several advantages compared to this type of Windows cluster. For example, Stratus kit ensures there is no application downtime even if one half of the server pair fails completely. By contrast, only applications that are fully cluster aware offer that degree of high availability using Microsoft Cluster Service.
Moreover, the Stratus hardware is clever enough to keep working even if both halves of the server pair suffer hardware faults, provided that each half of the server pair suffered a fault in a different core element. For example, the system would keep working if there was a CPU/RAM fault in one half and an I/O failure in the other. Although dual failures of this nature are extremely rare, the Stratus system would survive them while a simple dual cluster server configuration would not.
Given that the ftServer 6200 fitted with 8GB RAM and running VMware ESX Server could host eight or more Windows server systems, the Stratus VMware bundle could prove cheaper than running eight Windows Server 2008 Enterprise Edition systems in clustered configurations.
In our tests we found it difficult to tell the difference between ESX Server running on a normal host and ESX running on Stratus. In both cases, server administrators would normally manage the ESX Server environment using VMware VirtualCenter running on a separate Windows workstation. For its part, VirtualCenter is unaware that ESX Server is running on Stratus hardware. About the only hint that ESX could be running on Stratus kit is that two sets of Network Interface Cards (NICs) and storage HBAs are visible for each connection in the relevant VirtualCenter configuration page:
ESX Server administrators wanting to inspect the Stratus hardware would normally use the optional Virtual Technician Module (VTM) lights-out management interface, which costs £821 per pair. These provide dedicated Ethernet connections to Java-based management consoles that can be accessed via a web browser. We used the VTMs to perform high-level functions such as inspecting the system event log, accessing the system power button and activating a graphical remote-control session. For lower-level management information, we made an SSH connection to the ESX Service Console and used ftsmaint, a Stratus command-line tool. In our tests we used ftsmaint to list the hardware fitted in our environment and check its status:
As mentioned earlier, the VMware bundle was launched in January so it's early days for Stratus running VMware. Stratus currently supports ESX Server 3.0.2ft, and there are a few shortcomings. For example, the main ESX keyboard interface is disabled, so system administrators must connect to the ESX Service Console using a serial interface or an SSH session rather than by using the keyboard and monitor attached to the ESX Server hardware.
USB devices and the internal disks are also not currently supported, so we needed to boot ESX Server from SAN storage. And currently there's a delay of around five or ten seconds while the RAM in both halves of the server is resynchronised. This resync operation occurs only when an administrator has replaced faulty components in one half of the ftServer system and wants to reactivate the server pair to restore full fault tolerance.
In our tests we simulated a component failure. During the resync the virtual machines froze and then automatically resumed from where they left off. We set up two virtual machines running Windows, and configured each to ping an external system. During the resync four pings timed out before resuming at the normal rate. Our diagnostic software reported that it took 21,245 milliseconds to resync all the RAM. However, by September Stratus will support ESX Server 3.5.x, and this release will remove the above limitations. The resync delay should also have been removed by October.
We also tested the ftServer 6200 running Windows Server 2003 Enterprise Edition and found more comprehensive support for the Windows environments than (currently) for VMware.
Stratus supplies a plug-in for Microsoft Management Console that allows Windows server administrators to interrogate the Stratus hardware, update settings and manage failover operations. As you would expect of an MMC plug-in, it provided a hierarchical view of the system. For example, there were top-level folders for each CPU enclosure, and we could drill down into the enclosure hierarchy to see a list of DIMM slots, CPU sockets and chassis sensors inside each enclosure:
Likewise, there was another top level-folder for I/O enclosures listing devices such as Ethernet controllers, USB controllers, PCI slots and disk storage controllers:
We simulated a system failure by disconnecting the power supply to one of the server pairs. We confirmed that the CPU was no longer available using the MMC plug-in. For the next 30 minutes, any data written to the local hard disks would actually be written to a journal on the disk rather than directly to the filesystem — a similar technology to disk snapshots. Having reconnected the power lead we watched as the CPU module was initialised and booted back into service. As part of the resynchronisation process, the disk journal was replayed onto the restored system's disk. This brought it back into sync quicker than if the mirrored disk had been resynchronised without a journal. It took around 20 seconds to resynchronise disks in our tests.
We also simulated a network failure by removing the lead from one of the system's NICs. This was immediately reported by the MMC plug-in. Having reconnected the network lead, the system automatically resynchronised itself within a few seconds.
Finally we simulated a double disk failure by removing disk0 from one half of the server pair and disk1 from the other. We then replaced the disks and watched as the system was automatically restored to full fault tolerance. The standard warranty provides a two-day swap service for faulty parts. However, many customers would probably buy extended warranty and monitoring services from Stratus. A typical extended warranty would include 24/7 monitoring and 24-hour replacement of faulty parts.
We were impressed by the ftServer architecture. It seems Stratus has pretty much covered all the bases to ensure as little unplanned server downtime as possible. Virtual server environments could potentially host several mission-critical applications on a single piece of server hardware. Although the high-availability (HA) features in VMware VirtualCenter would automatically restart failed servers, ftServer 6200 could reduce the amount of downtime to almost zero — a target that VMware is still some way off.
Stratus does not currently support the ftServers running Windows Server 2008. This is expected to be added before the end of the year.