# Making sense of "mean time to failure" (MTTF)

Last week researchers at Carnegie Mellon University published a paper which examined the real-world reliability of hard drives. It concluded that hard drive failure rate was much higher (by a factor of about fifteen times higher) than that expected based on mean time to failure (MTTF) information supplied by manufacturers. So why is there such a huge difference between MTTF ratings and the real world?

Last week researchers at Carnegie Mellon University published a paper which examined the real-world reliability of hard drives.  It concluded that hard drive failure rate was much higher (by a factor of about fifteen times higher) than that expected based on mean time to failure (MTTF) information supplied by manufacturers.

Unless you know how it's worked out, it can paint a picture that's far from what you can expect in realitySo why is there such a huge difference between MTTF ratings and the real world?

The difference between real-wold reliability and MTTF ratings is down to how MTTF is worked out.  Obviously, when a drive has a MTTF rating of 1,000,000 hours (around 114 years) no manufacturer has actually invested that amount of time in testing a series of drives to see if they last that long.  In fact, MTTF has nothing to do with how long a single drive, or series of drives, is expected to last.  MTTF is a statistical trick that can be adequately summarized by the formula below:

([a short time period] * [number of items tested]) / [number of items tested which failed within time period] = MTTF

So, let's say that a hard drive manufacturer tested a sample of 1,000 drives for a period of 1,000 hours (just over 41.5 days) and within that period of time one hard drive failed, this would give us:

(1,000 * 1,000) / 1 = 1,000,000 hours

I'm simplifying the process here quite a bit because nothing is ever that clear cut, but this gives you the basics of how MTTF is worked out without bogging the discussion down in statistics, which isn't a strong point of mine!

But don't read this as meaning that a drive will last for 1,000,000 hours or 114 years.  No, the way to read this is that if you took 114 drives and run them for a year, you'd expect that one drive would fail.  That's it.  That's all that it means.  It's a figure worked out from a small sample over a short period of time.  The "hours" bit at the end of the rating is there because the only unit used in the calculation is time.

So, what's wrong with MTTF?  Well, first off, unless you know how it's worked out, it can paint a picture that's far from what you can expect in reality.

Another problem with MTTF is that is ignores the fact that most devices become less reliable towards the end of their life because of wear (an effect known as the "bathtub curve").  However, some would argue that the MTTF rating is balanced out by the fact that early failures weigh heavily against the final rating.  That might be the case but failures during the wear-out period (after the 5 to 7 year period) still outweigh early failures (up to the 1 year mark).

Note that some hard drive manufacturers now use annualized failure rate (AFR).  This is the reciprocal (expressed as a percent) of the MTTF expressed in years.  So, for a MTTF of 1,000,000 hours, this gives:

(1,000,000 hours / 8,760 hours/year) = 114.16 years

(1 failure / 114.16 years) * 100% = 0.86%

My policy with hard drives on desktop PCs goes something like this.  When I get a new hard drive I'm suspicious of it for the first few weeks.  I might burn it in or I might not, but I like to get it settled in before relying on it too much.  During that period I might not store important data on it without making sure that I have a backup somewhere else.  After a few weeks have passed I feel better about the drive and put it into normal service.

Personal note: I'm pretty sure that I've only ever received one drive that was DOA and I've had maybe two die on me within the first week.

I'm still aware that it still has the potential to fail rapidly and without warning.  After about 5 to 7 years of use, if the drive is still going I'll probably retire it from handling important data and put it to work somewhere where a fault isn't going to cause me too much headaches, for example, by moving it to a test bed system, sticking it in an external hard drive case for transporting data about, or give it to one of my kids so they can have more space for games and music.

Personal note: I think that I've only had a small handful of drives die on me within the quite generous warranty period that most hard drive manufacturers offer.  If I get 5 years from a drive, I'm happy.

Over at the PC Doc HQ, the most common cause of PC failure that I see is due to hard drive failure.  Seeing these kinds of failures make me more fanatical about backing up than the average PC user, although there are times when I do push my luck.  Hard drive failures usually come quickly and with little or no warning so having a backup that you can rely on is vital in my opinion - you might not get a chance to make that "next" backup.  Treat every backup as if it's the last one that you'll do on that drive.

Personal note: All dead hard drives get the same treatment - I open them up, remove the really powerful magnets from the head actuator assembly (because these are really cool and come in handy), destroy the platters (glass ones are smashed, aluminum ones hammered) and then the components are disposed of ethically at a recycling center.

After hard drive failures, the next most common hardware failure that I see is fan-related (fans get noisy or just pack up altogether), PSU failures and optical drive failures.

What do you make of MTTF ratings?  What kind of lifespan do you get from your hard drives?  Do you run drives until the croak or do you move older drives into less critical areas?  What kind of failures do you most commonly see?