Why computers fail

Why computers fail

Summary: Good failure data for PCs is hard to find: who knows how many times PC users are told to reinstall Windows? But in a recent paper, Bianca Schroeder and Garth Gibson of CMU found some surprising results in 10 years of large scale cluster system failures at Los Alamos National Labs.

SHARE:
TOPICS: Processors, Hardware
43

Good failure data for PCs is hard to find: who knows how many times PC users are told to reinstall Windows? But in a recent paper, Bianca Schroeder and Garth Gibson of CMU found some surprising results in 10 years of large scale cluster system failures at Los Alamos National Labs.

Among the surprises: new hardware isn't any more reliable than the old stuff. And even wicked smart LANL physicists can't figure out the cause for every failure.

Special problems of petascale computing Despite the incredible performance of Roadrunner, LANL's new petaflop computer, the jobs it runs often take months to complete. With 3,000 nodes failures are inevitable.

What to do? LANL's strategy is stop the job and checkpoint. When a node fails they can roll the job back to the last checkpoint and restart, preserving the work already done - but losing the work done after the checkpoint.

Even using massively parallel high-performance storage the checkpoints take time away from getting the answer. Understanding Failures in Petascale Computers uses LANL's data to better manage the tradeoffs and to suggest new strategies.

But its the failure data itself - and what it suggests about our own computers - that I found most interesting.

Failure etiology Hardware accounts for over 50% of all LANL failures - with software about 20%. Given all the PhD's at LANL you'd hope human error would be low on the list - and it is.

Here's the graph:

Root cause analysis of system failures

Is reliability improving? Nope. LANL hasn't seen any improvement over the years - even with hardware from a decade ago.

Failures per year per processor

The key metric The research showed that

. . . the failure rate of a system grows proportional to the number of processor chips in the system.

Which is a big problem for massive multi-processor systems.

The Storage Bits take Extrapolating these results to our desktop systems is straightforward - with one big caveat: most desktop system crashes are software, not hardware.

Otherwise the Blue Screen of Death would be the No Screen of Death.

The biggest finding is that we shouldn't expect our system hardware to get more reliable. Improvements get balanced out by increased complexity.

Those of us with multi-processor systems can expect to see less reliability - though with just a few systems you won't see any trends. It's a classic "glass half full" situation: our systems won't get better, but al least they won't get worse.

Comments welcome, of course.

Topics: Processors, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

43 comments
Log in or register to join the discussion
  • A good UPS...

    will decrease failure rates tremendously. I use Belkin AVR UPS's and the only systems I've had go down in the last year were ones installed by my predecessor without UPS's. There is no doubt in my mind that bad power is a PC's worst enemy. I also use Proxy servers with whitelists and malware is non-existent on my network.

    Most computer failures can be avoided in the workplace. Just do it right.
    bjbrock
    • Power Surges

      Have been responsible for my power supply and harddisk failures; so I would have to agree.

      Also, at my company, they installed an enterprise spyware protection client on every machine. That has really helped keep desktop system failures to a minimum -- far more that AV software. All the additional email filtering systems really do a ton of work, too, to keep threats outside the network.

      Yes, if your admins / network engineers do their jobs, your desktop systems will be far better off.

      Still, there's no getting around that fact that servers running 24/7 will have various hardware failures: NIC cards, hard-disks, memory, etc., but that should be expected with such systems.
      Spats30
      • Just a thought

        But could not your company fit surge protection especially for 24/7 systems,I use one ok cheapy which will or not work?but there are some mighty expensive one's out there for music systems etc so do not no whether to expensive or useful for company's.
        morrigen
        • ...another thought

          Not to be mean, but is English your second language? I can't understand what it is you're even suggesting...?
          isotla
    • You may want to rethink your "good" UPS

      From my understanding, the Belkin UPS line are Stand By UPSes. This means they run equipment from Wall power and only switch to battery when a surge or drop occurs. This is an issue because that means there is a small window of exposure. There are other brands out there (MGE being the one that comes to mind for the price) that are Online UPSes. These run off the battery 24x7 and continuously charge the battery. For server and networking equipment it's a must. Desktops are usually fine with a Standby.
      LiquidLearner
  • ...the failure rate of a system grows with the number of processors.

    This is analogous to the problems that started showing up about 20-25 years ago in memory chips. Manufacturing yields were so low for the then-new and then astonishingly dense (:-)) 64k-bit chips that IBM was about to build an entire new factory in Vermont to make largely useless chips. And despair was in the air when they figured out that the passage of a single alpha particle through a memory chip was enough to randomly change the state of bits--how do you stop an alpha particle?

    On the manufacturing side, they solved a lot of problems by building a lot of extra bits into the chips and then, in testing, internally configure them to omit bad bit blocks--you could have bit failures as high as, IIRC, about 20% and still have enough bits left to make a working 64k device. (A friend of mine invented that technique and saved IBM the billion bucks or whatever of a new factory--they were quite grateful...)

    On the run-time side, they got very creative with error detection/correction stuff with Hamming codes and the like--another form of redundancy.

    My guess is that if MPP failure rates are high due to statistical factors, they're going to have to figure out a means of incorporating computational redundancy in the processing similar to the bit redundancy built into memory devices.
    Henrik Moller
  • Software crashes

    It's interesting that you say, "most desktop system crashes are software (crashes), not hardware". As a IT pro, I take great pains with my home Windows box to remove crapware that might come in an OEM system, to never install unnecessary software bundles that come with new hardware, to always install the latest drivers from the manufacturer's website, to upgrade firmware periodically, keep virus definitions up to date, run a hardware and software firewall, run anti-spyware and anti-rootkit checks, etc.

    I haven't had a software failure in about eight years.

    You mentioned how difficult it is to get good failure data, but I would love to see a breakdown of failures of home systems by owner's profession, to see how much a difference a technical background makes in home system stability. I'd also love to see a breakdown of technical vs. technical households with regards to "best practices" type implementations.

    We say all the time thing like, "but the average user won't understand that" in regards to certain computing concepts. I would be interested in some concrete information just who "the average user" is, and what he or she is able to understand.
    RationalGuy
    • Just to clarify...

      I just re-read my post, and it sounds a little "oh I'm in IT and I'm so smart."

      I'm really interested in this idea to see how technologists can make technology better with an eye to how regular people are actually going to be using it. "The average user" can't be expected to have as much knowledge as a pro, and the technology should not require them to have such knowledge before the tech is useful.
      RationalGuy
    • Software failure?

      "I haven't had a software failure in about eight years."
      How do you define software failure? I've been in the industry for 30 years and have yet to nail down exactly what defines "software failure" or "broken software". Does that mean corruption and damaged code (which may be caused by either hardware or software issues)? Or unintentional consequences of running software a certain way in certain situations? Or attacks upon software by other software?

      I highly doubt your claim that you haven't had a software failure in about eight years. That's simply too vague a statement.
      jvenezia
      • By no software failure, I mean ...

        ... I haven't had a system failure (crash, lockup, blue screen) that I wasn't able to trace back to a hardware problem in the last eight years.

        [i]Does that mean corruption and damaged code (which may be caused by either hardware or software issues)?[/i]
        I haven't had any major data corruption issues in that time, and certainly none that have caused a system failure, except in two instances of complete hard drive failures.

        [i]Or unintentional consequences of running software a certain way in certain situations?[/i]
        Certainly pieces of software don't live up to the brochure in terms of functionality sometimes, but that hasn't caused major instability in any of my systems.

        [i]Or attacks upon software by other software?[/i]
        No major malware attacks. A couple of spyware problems that I was able to clear up quickly.

        [i]I highly doubt your claim that you haven't had a software failure in about eight years. That's simply too vague a statement.[/i]
        I hope this cleared up your questions. If not, I don't know what else to say, except what I'm saying is true.
        RationalGuy
      • Most software failures can be avoided.

        Well he replied to it, but I guess I have my own opinion as well =).

        Bugs I've seen a lot, but that usually just means avoiding buggy software. If you know where to look, you can find plenty of software that is well behaved and stable, so it's certainly possible to set up a system that almost never crashes due to software.

        Corruption via hardware I've only seen on cheap brand name computers, or if I make an obvious mistake when building a computer. Once I've set up a good hardware configuration, however, I never see hardware corruption except for the occasional hard drive failure.

        "unintentional consequences of running software a certain way in certain situations" can be a problem, but that's something totally avoidable. It only happens when I'm tinkering around with stuff.

        Malicious software I haven't seen in years. It's not really that difficult to be secure, it just takes a bit of effort. I'm not going to say it's impossible to get infected, but it is certainly possible to get a system secure enough that the time between infections can be measured in years. I run an antivirus regularly just in case something falls through the cracks.
        CobraA1
    • Background on home users

      "You mentioned how difficult it is to get good failure data, but I would love to see a breakdown of failures of home systems by owner's profession, to see how much a difference a technical background makes in home system stability. I'd also love to see a breakdown of technical vs. technical households with regards to "best practices" type implementations."

      I'll use myself and my mother as examples. I am computer savvy (no tech degree, and I only went to college for auto mechanics) I've had a few problems at home, a faulty video driver upgrade that I had to roll-back, and a couple system freezes due to Norton AV, which I un-installed and used AVG instead. I run XP, Office 03 and a few proprietary programs I need for my job.

      My mother, on the other hand, is an administrator at a hospital, barely knows how to open and send email. Her home system is an O-O-T-Box Sony with crapware removed, and AVG installed. She frequently has little problems such as can't print, no internet etc, that I can solve for her by going over to her house to fix.

      I will admit that if I didn't live so close, she'd have to pay someone ELSE to do this, though I seriously think she would get rid of the computer and go without first.

      I'm not in the technical field, I'm in the non-profit field, and the office-appointed computer guru. I have to take on any computer type challenges as they come as well as my data entry/graphics work.

      At home I just like to do research on my favorite subjects. I am a graphics designer and have a large graphics collection on my home computer.
      hsec2@...
      • I agree

        I my self Am a high school student, i don't have a tech degree, Yes i have had problems with my desktop, but nothing huge, most windows users will get viruses, i my self have gotten a few, but have resolved them all. Also id like you to consider, that not everyone who fixes computers as a job; is all that monstrously tech savvy. All you really need to do to get an it job in some places, is 1) know how to format a hardrive 2) know how to back up a hardrive 3) know how to hit the "scan" button in various programs(anti virus, Registry cleaners, anti spy ware/malware ect.)
        for hardware issues many places just send the computer out. The funny thing is that people still take their computers to "these places" (trust me =) )
        The average user really should consider learning these basic things, i have personally known how to format a hard drive since i was in grade 4. There is no reason that a middle aged adult cannot learn the basics, if they actually use a computer.( i know a few people who never use their computers, kudos to them: i just cannot do that)

        -emen
        EmenbladE
    • You're confusing able and wishes to

      The problem with your theory is assuming a lot of these people lack the intelligence to know how to keep a computer running optimally. Usually that's not the issue, it's that they can't be bothered with it. Do you think of your car mechanic as a genius? Nope. You could probably figure it out if you wanted to. It's much easier, and because of time saved often more cost effective, to call an expert out to take care of the "problem" for them.
      LiquidLearner
      • I posted a clarification to my original post

        Sorry, my first post, when I looked back at it, had a tone that could be misconstrued.

        I agree with what you're saying here. It has nothing to do with intelligence. My brother is a really smart person, but he just wants a computer that works. He could figure it out, in fact back in the day he taught me how to program in BASIC. It's not that he lacks the ability, he lacks the interest. He doesn't like working on computers.

        I'm interested in finding out where, on average, the line really is. I don't think that the "average user" is unwilling to do [i]any[/i] of this kind of maintenance, but it's clear that on the Windows front, MS and other software companies are asking for too much user involvement at too low a level. I'd like to find out where the sweet spot in between actually is.
        RationalGuy
  • RE: Why computers fail

    "new hardware, to always install the latest drivers from the
    manufacturer's website, to upgrade firmware periodically,
    keep virus definitions up to date, run a hardware and
    software firewall, run anti-spyware and anti-rootkit
    checks"

    I've known People who have done that but still got burned,
    all it takes is one very nasty security hole in the browser a
    vendor knew about for 10 months but didn't bother to fix
    to ruin everything and there goes fortress desktop.
    Telix
  • Clarification, please!

    "the failure rate of a system grows [b][i]proportional[/i][/b] to the number of processor chips in the system."

    Does that mean that with the same amount of memory, same number of hard disks, same disk space, etc., a motherboard with two processors will have twice as many failures as one processor?

    Does that mean that if a system increases from one motherboard to three it will have three times as many failures or nine time?

    It seems logical that quadrupling the overall amount of hardware would quadruple the number of failures. But that is a linear progression, so in a sense it is not "proportional".
    Rick_R
    • A proportion is ...

      ... an equation of two ratios (e.g., 1/2 = 2/4)

      If you create a ratio of the number of processors (p) to the likelihood of failure (f), and you can show that the likelihood of failure grows by the same factor as the number of processors, you've shown the two ratios are proportional:

      1p/2f = 2p/4f
      RationalGuy
  • BSOD

    I don't think this statement is true:
    "Otherwise the Blue Screen of Death would be the No Screen of Death."

    I have on several occasions gotten a BSOD when the root issue was hardware related. A BSOD can occur when an attempt to load a driver for a malfunctioning piece of hardware fails. Simply replacing the bad hardware often fixes the issue.
    t_mohajir
    • Agreed

      Yeah, hardware failures can definitely cause BSODs. It all depends on what failed and where it failed.
      CobraA1