Making sense of "mean time to failure" (MTTF)

Making sense of "mean time to failure" (MTTF)

Summary: Last week researchers at Carnegie Mellon University published a paper which examined the real-world reliability of hard drives. It concluded that hard drive failure rate was much higher (by a factor of about fifteen times higher) than that expected based on mean time to failure (MTTF) information supplied by manufacturers. So why is there such a huge difference between MTTF ratings and the real world?

SHARE:
TOPICS: Hardware
14

Last week researchers at Carnegie Mellon University published a paper which examined the real-world reliability of hard drives.  It concluded that hard drive failure rate was much higher (by a factor of about fifteen times higher) than that expected based on mean time to failure (MTTF) information supplied by manufacturers. 

Unless you know how it's worked out, it can paint a picture that's far from what you can expect in realitySo why is there such a huge difference between MTTF ratings and the real world?

The difference between real-wold reliability and MTTF ratings is down to how MTTF is worked out.  Obviously, when a drive has a MTTF rating of 1,000,000 hours (around 114 years) no manufacturer has actually invested that amount of time in testing a series of drives to see if they last that long.  In fact, MTTF has nothing to do with how long a single drive, or series of drives, is expected to last.  MTTF is a statistical trick that can be adequately summarized by the formula below:

([a short time period] * [number of items tested]) / [number of items tested which failed within time period] = MTTF

So, let's say that a hard drive manufacturer tested a sample of 1,000 drives for a period of 1,000 hours (just over 41.5 days) and within that period of time one hard drive failed, this would give us:

(1,000 * 1,000) / 1 = 1,000,000 hours

I'm simplifying the process here quite a bit because nothing is ever that clear cut, but this gives you the basics of how MTTF is worked out without bogging the discussion down in statistics, which isn't a strong point of mine!

But don't read this as meaning that a drive will last for 1,000,000 hours or 114 years.  No, the way to read this is that if you took 114 drives and run them for a year, you'd expect that one drive would fail.  That's it.  That's all that it means.  It's a figure worked out from a small sample over a short period of time.  The "hours" bit at the end of the rating is there because the only unit used in the calculation is time.

So, what's wrong with MTTF?  Well, first off, unless you know how it's worked out, it can paint a picture that's far from what you can expect in reality. 

Another problem with MTTF is that is ignores the fact that most devices become less reliable towards the end of their life because of wear (an effect known as the "bathtub curve").  However, some would argue that the MTTF rating is balanced out by the fact that early failures weigh heavily against the final rating.  That might be the case but failures during the wear-out period (after the 5 to 7 year period) still outweigh early failures (up to the 1 year mark).

Note that some hard drive manufacturers now use annualized failure rate (AFR).  This is the reciprocal (expressed as a percent) of the MTTF expressed in years.  So, for a MTTF of 1,000,000 hours, this gives:

(1,000,000 hours / 8,760 hours/year) = 114.16 years

(1 failure / 114.16 years) * 100% = 0.86%

My policy with hard drives on desktop PCs goes something like this.  When I get a new hard drive I'm suspicious of it for the first few weeks.  I might burn it in or I might not, but I like to get it settled in before relying on it too much.  During that period I might not store important data on it without making sure that I have a backup somewhere else.  After a few weeks have passed I feel better about the drive and put it into normal service.

Personal note: I'm pretty sure that I've only ever received one drive that was DOA and I've had maybe two die on me within the first week. 

I'm still aware that it still has the potential to fail rapidly and without warning.  After about 5 to 7 years of use, if the drive is still going I'll probably retire it from handling important data and put it to work somewhere where a fault isn't going to cause me too much headaches, for example, by moving it to a test bed system, sticking it in an external hard drive case for transporting data about, or give it to one of my kids so they can have more space for games and music.

Personal note: I think that I've only had a small handful of drives die on me within the quite generous warranty period that most hard drive manufacturers offer.  If I get 5 years from a drive, I'm happy.

Over at the PC Doc HQ, the most common cause of PC failure that I see is due to hard drive failure.  Seeing these kinds of failures make me more fanatical about backing up than the average PC user, although there are times when I do push my luck.  Hard drive failures usually come quickly and with little or no warning so having a backup that you can rely on is vital in my opinion - you might not get a chance to make that "next" backup.  Treat every backup as if it's the last one that you'll do on that drive.

Personal note: All dead hard drives get the same treatment - I open them up, remove the really powerful magnets from the head actuator assembly (because these are really cool and come in handy), destroy the platters (glass ones are smashed, aluminum ones hammered) and then the components are disposed of ethically at a recycling center.

After hard drive failures, the next most common hardware failure that I see is fan-related (fans get noisy or just pack up altogether), PSU failures and optical drive failures. 

What do you make of MTTF ratings?  What kind of lifespan do you get from your hard drives?  Do you run drives until the croak or do you move older drives into less critical areas?  What kind of failures do you most commonly see?

Topic: Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

14 comments
Log in or register to join the discussion
  • NEVER replace a POWER SUPPLY FAN!

    I worked as a component-level repair technician for 10 years (i.e., find the capacitor or IC on the board and replace it, don't just replace the board).

    Later, on two occasions I had PC supplies where the fan got noisy. I figured, heck, I have the technical/soldering skills to replace a [i]fan[/i]. (This was when supplies cost a lot more.)

    BIG MISTAKE! The fan is the weakest part of the supply. Once you fix that, eventually the next weakest thing goes--[i]the voltage regulator[/i]. And that has a tendency to cause instability, trash memory, etc., before it finally dies. In one case it trashed a memory stick and the hard disk, and I had to spend 13 hours restoring from a backup to a new disk.
    Rick_R
    • Chances are the fan did not die.

      Usually the bearing dries out and just needs to be oiled. I just don't know how you can stand the noise before it seized up. I just oil them before they lockup and they last for years afterward. A stopped fan would allow your components to overheat and weaken. I hear that Pabst cooling fans are the best but cost a lot more. Cheap power supplies are no bargain and I believe that cooling fans that stop are the number one reason PCs fail. I think CDR/RW are the shortest lived components after fans and hard drive failure is not mounting them near the air flow. Also MTTF is another variant of MTBF (Mean Time Between Failure) which is more popular in Engineering.
      osreinstall
    • Correct approach.

      I have the same careful approach to hard drives.
      The MTBF value is only an indication of a possible Gaussian failure distribution.
      It can be seen as a sort of the median value of the distribution.
      The problem is that in that perfect failure event description the conditions differ with time.
      and they also differ from test to reality. So nothing like a bit of caution.
      So one as to count on time changing parameters and this could distort the Gaussian distribution parameters.
      Several factors influence the drive live, one is no doubt the electric conditions inside the PC and the other is the temperature and also, of course, drive activity.
      Specially because those drives tests are done under "perfect/standard" conditions and even like that they differ between brands.
      And Real Motherboard conditions can be very different.
      Another bothering factor is the difference between testing in different brands.
      I can not obviously test hdd for MTBF but at least in terms of speed they are different.
      I once had 5 IBM SCSI 75GB installed in a server of a friend of mine.
      One day he was making experiments and removing one disk in the live system and dropped a drive ... this stuff happens when we least expect it. To add for all this bad luck the drive hit the ground on a hard surface and broke one of the small chips under the hdd ... gone.
      I only had a Maxtor with the same specs, 10000RPM, 75GB, same access time, everything was equal.
      So I had to help him out as it was not possible to find a replacement in time.
      The IBM and Maxtor Specs where exactly the same but ... guess what the speed of the IBM's was far superior!
      How to know that? Easy, the RAID 5 controller (Adaptec) access Led's where always much more time open (green) on the Maxtor position then on all other IBM's.
      The IBM's where faster, the Maxtor was slowing the Raid ... but their specs where the same!
      This has to do with how they measure the speed.
      If there is a program to read only data on the external border (away from the center) where linear disk speed is greater the test results change a lot.
      So this means that results can be .... "obtained" in many ways.
      Only a detailed description of the tests made can show how reliable the MTBF data is.
      (Is the test done with constant random disk access or is the disk just running with no access ? How is disk read/write sequence made if any?)
      It is said that "The Devil is on the details" ...
      Nothing like a careful approach.

      Regards,
      Pedro
      p_msac@...
      • That was involved.

        Yes we do not know for sure how they tested the drives for a stress test. I believe dropping them wasn't part of the test. I believe that is a very small sampling of a "G" force test. To make it short, I just mount Seagates in front of the front cooling fan for max reliability. I also back the data up to a server and burn to disk. Nothing like hi-tech hot potato with data.
        osreinstall
  • I believe the correct term is MTBF.

    Mean Time Between Failures. The current figures advertised are for drives that are left in continuous operation. If you are powering the drives up/down then the numbers would be lower.
    ShadeTree
    • Both MTTF and MTBF ...

      ... seem interchangeable. However, there is a subtle different.

      MTTF is a basic measure of reliability for non-repairable systems while MTBF is a basic measure of reliability for repairable items. hard drives are throw away so we use MTTF.

      There's more ...

      When the time needed to repair or replace an item is much shorter than MTTF, MTBF is roughtly equivalent to MTTF. However, where there is a significant repair time, you have to take into account MTTR (Mean Time To Repair)

      MTBF = MTTF + MTTR
      Adrian Kingsley-Hughes
      • Must be a new term. MTBF is used in engineering.

        The rolling bearing industry uses MTBF for projected life expectancy and those are throw away items also, or non-repairable just like a hard drive. Besides when something is repairable that can be fixed, it is considered a failure. Otherwise it comes under preventative routine maintenance.

        http://www.google.com/search?hl=en&q=ball+bearings+mttf&btnG=Search
        Ball Bearing & MTTF
        609 hits

        http://www.google.com/search?hl=en&q=ball+bearings+mtbf&btnG=Search
        Ball Bearing & MTBF
        74,100 hits

        118,000 & 648,000 hits respectively for hard drives. Must be the computer industry reinventing terminology. The ratios are not as drastic.
        osreinstall
  • smart

    If you've got smart diagnostics running (on by default in openSUSE 10.2), there's ALOT of information to be had about your hard drive and it can, some cases predict a failure, not to mention record the history of self-diagnostics!--here's sample output from one of my vintage laptops (5 year old IBM/Hitachi TravelStar) mtbf drives--like the Timex commercial--"takes a lickin' and keeps on a tickin'":

    ~~~~~~~~~~~~~~~~~
    linux:~ # smartctl -a /dev/hda
    smartctl version 5.37 [i686-suse-linux-gnu] Copyright (C) 2002-6 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/

    === START OF INFORMATION SECTION ===
    Model Family: IBM/Hitachi Travelstar 60GH and 40GN family
    Device Model: IC25N030ATCS04-0
    Serial Number: CSH308DHM838UB
    Firmware Version: CA3OA71A
    User Capacity: 30,005,821,440 bytes
    Device is: In smartctl database [for details use: -P show]
    ATA Version is: 5
    ATA Standard is: ATA/ATAPI-5 T13 1321D revision 3
    Local Time is: Sun Feb 11 22:05:55 2007 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: Disabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: ( 645) seconds.
    Offline data collection
    capabilities: (0x1b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    No Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    No General Purpose Logging support.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 37) minutes.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
    2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
    3 Spin_Up_Time 0x0007 108 108 033 Pre-fail Always - 1
    4 Start_Stop_Count 0x0012 098 098 000 Old_age Always - 4585
    5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
    7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
    8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
    9 Power_On_Hours 0x0012 076 076 000 Old_age Always - 10937
    10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
    12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 4505
    191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
    192 Power-Off_Retract_Count 0x0032 094 094 000 Old_age Always - 1341
    193 Load_Cycle_Count 0x0012 078 078 000 Old_age Always - 227570
    194 Temperature_Celsius 0x0002 122 122 000 Old_age Always - 45 (Lifetime Min/Max 13/61)
    196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 51
    197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 5
    198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

    SMART Error Log Version: 1
    ATA Error Count: 166 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 166 occurred at disk power-on lifetime: 10926 hours (455 days + 6 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    01 59 07 ca bd 39 e2 Error: AMNF at LBA = 0x0239bdca = 37338570

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    20 00 07 ca bd 39 e2 00 20:08: 18.600 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:18.600 RECALIBRATE [OBS-4]
    91 00 3f 3f ff 3f ef 00 20:08:18.600 INITIALIZE DEVICE PARAMETERS [OBS-6]
    20 00 07 ca bd 39 e2 04 20:08:18.600 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:08:14.200 READ SECTOR(S)

    Error 165 occurred at disk power-on lifetime: 10926 hours (455 days + 6 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    01 59 07 ca bd 39 e2 Error: AMNF at LBA = 0x0239bdca = 37338570

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    20 00 07 ca bd 39 e2 00 20:08: 14.200 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:08:09.800 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:09.700 RECALIBRATE [OBS-4]
    20 00 07 ca bd 39 e2 00 20:08: 05.400 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:08:00.900 READ SECTOR(S)

    Error 164 occurred at disk power-on lifetime: 10926 hours (455 days + 6 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    01 59 07 ca bd 39 e2 Error: AMNF at LBA = 0x0239bdca = 37338570

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    20 00 07 ca bd 39 e2 00 20:08: 09.800 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:09.700 RECALIBRATE [OBS-4]
    20 00 07 ca bd 39 e2 00 20:08:05.400 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:08: 00.900 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:00.900 RECALIBRATE [OBS-4]

    Error 163 occurred at disk power-on lifetime: 10926 hours (455 days + 6 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    01 59 07 ca bd 39 e2 Error: AMNF at LBA = 0x0239bdca = 37338570

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    20 00 07 ca bd 39 e2 00 20:08: 05.400 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:08:00.900 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:00.900 RECALIBRATE [OBS-4]
    91 00 3f 3f ff 3f ef 00 20:08: 00.900 INITIALIZE DEVICE PARAMETERS [OBS-6]
    20 00 07 ca bd 39 e2 04 20:08:00.900 READ SECTOR(S)

    Error 162 occurred at disk power-on lifetime: 10926 hours (455 days + 6 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    01 59 07 ca bd 39 e2 Error: AMNF at LBA = 0x0239bdca = 37338570

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    20 00 07 ca bd 39 e2 00 20:08: 00.900 READ SECTOR(S)
    10 00 3f 00 00 00 e0 00 20:08:00.900 RECALIBRATE [OBS-4]
    91 00 3f 3f ff 3f ef 00 20:08:00.900 INITIALIZE DEVICE PARAMETERS [OBS-6]
    20 00 07 ca bd 39 e2 04 20:08:00.900 READ SECTOR(S)
    20 00 07 ca bd 39 e2 00 20:07:56.500 READ SECTOR(S)

    SMART Self-test log structure revision number 1
    No self-tests have been logged. [To run self-tests, use: smartctl -t]


    Device does not support Selective Self Tests/Logging
    linux:~ #

    ~~~~~~~~~~~~~~~~~~

    5 years and still going! So much for MTBF!
    D T Schmitz
    • Anyone else think ...

      ... thatthese words landed with a thud:

      "5 years and still going! So much for MTBF!"

      That's really tempting fate!
      Adrian Kingsley-Hughes
      • I guess I am tempting fate! :)

        nt
        D T Schmitz
    • Unfortunately the SMART system...

      Isnt always reliable, the drive can fail without warning and also you can get warning of drive failure and the drive continue to work for years.
      mrlinux
  • 5,000,000 hrs MTBF: Intel stomps into flash memory

    [url=http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9012941]fyi[/url]
    D T Schmitz
  • Hate the way people use statistics

    You're absolutely right about the MTTF only meaning that out of 100 drives, X number of drives can be expected to fail within the first year.

    And you're also right that it requires a good amount of statistics to give a more accurate description of the phenomenon.

    Personally, I deplore the use of a naked statistic such as just a mean; as it really doesn't do a very good job of describing the range of variability of failures. Inclusion of the standard deviation, or even the median and mode values can help provide a better picture of how often and when failures occur. Sure, 1 out of a hundred drives might fail during the first year, and you might find a drive still running fine at the 10 year point; but if the other 98 drives failed at the 13 month point would certainly indicate that this was a HDD that you shouldn't buy.
    Dr_Zinj
  • Floppy drives #1 fail points per operating hours

    In my experience the list goes like this:

    #1 for failures per operating hours has to be the lowly floppy drive - the less you use them the quicker they seem to die. Mostly due to environmental issues like dust and dirt infiltration. Keep them clean and they will last much longer (same goes for cdrom drives). Floppy drives are becomming incresingly obsolete nowdays though.

    #2 is indeed the hard drive. Older drives used to "sieze up" when being powered down after running for a while (a condition known as stiction)I've experienced the dreaded "pinging" sound of the head hitting the platter, the squeal of a failing bearing, and the erie quiet of a completely dead drive.

    #3 any and all manner of cooling fans - bearing issues mostly.

    #4 power supplies

    Basically - anything mechanical will usually fail first.

    I have rarely replaced circuit boards due to failure - mostly just due to upgrades.
    Labrat636