For the past year, Sun Microsystems Inc. has struggled to solve a mysterious fault that can cause its high-end servers to crash unexpectedly, an embarrassing problem for a computer maker that routinely refers to its servers as "rock solid" reliable.
The Palo Alto, Calif., company said the problem, which was eventually traced to a memory flaw, is rare and probably affects fewer than 1% of all computers it has sold. Since the problem was first identified in its servers, the large computers often used to manage databases and handle e-commerce tasks, Sun engineers have put together a variety of hardware and software fixes that appear to reduce the risk of spontaneous crashes. Sun has also revamped internal quality programs designed to prevent reliability problems in the first place.
"Sometimes in life, a problem becomes an answer," says John Shoemaker, Sun's executive vice president for system products, who says the problem has pushed Sun to think harder about ways to make its systems more reliable. "We're feeling fairly confident we have this thing covered."
Some critics, however, argue that the problem is more serious that Sun is willing to admit. Paul McGuckin, an analyst with Gartner Group who deals regularly with major corporate customers, said that roughly 60 major Gartner clients have reported problems with as many as several hundred Sun servers.
"There are a lot of unhappy Sun customers out there," says McGuckin, who notes that many Gartner clients complained that Sun took too long to acknowledge the problem's significance and that some believe the computer maker tried to squelch open discussion of the issue.
Shoemaker denies any coverup, saying only that Sun initially required customers who reported the problem to sign a nondisclosure agreement because of the large quantity of internal technical information the computer maker opted to share in an attempt to solve the problem. Eight or nine months ago, when Sun realized the spontaneous-crash problem was more common than it first thought, Shoemaker says, it stopped requiring customers to sign such agreements.
While the problem can't have helped Sun's image, it hasn't appreciably harmed the company, either.
The memory fault has been the subject of much online discussion as well as articles in both the trade press and Forbes magazine. Sun officials from Chairman Scott McNealy on down have recently discussed the issue in public appearances before technical audiences. Throughout, Sun's sales have continued to accelerate, rising 60% in the third quarter compared with a year earlier.
The main culprit involves so-called cache memory, a type of semiconductor memory that sits on the same circuit board as a server microprocessor and is used to speed its calculations. Every so often, the cache memory in these servers develops so-called parity errors in which a single bit of data is flipped from a 1 to a 0 or vice versa for reasons that aren't entirely understood. (Sometimes a cosmic ray can be at fault.) Such errors aren't any more common in Sun machines than in others, Shoemaker says.
Difficult to duplicate
But Sun's current servers lack sophisticated error-correction software that can often catch such errors on the fly, and as a result can crash when they occur. Because parity errors are essentially random, however, the problem was difficult to duplicate, an essential first step toward solving it.
At online auctioneer eBay Inc., a major Sun customer, the memory problem initially resulted in one to two big-system crashes every month, says Maynard Webb, president of eBay Technologies. The memory problem wasn't behind eBay's most serious outage, a 22-hour failure in 1999, but has caused other service outages, Webb says, although he adds that with Sun's fixes in place eBay can now go "months" without seeing a problem.
Sun's newest servers, based on its UltraSparc-III processor, contain more-sophisticated error-checking and aren't prone to the memory error, Shoemaker says. Those machines, however, won't be fully available until the middle of next year. For existing machines, Sun has offered a variety of software patches and has replaced the main circuit boards on some machines. Soon, Sun plans to offer a new circuit-board replacement designed to catch parity errors by using twice as much cache memory in a so-called mirror configuration that cross-checks stored data.