How long do disk drives really last?

How long do disk drives really last?

Summary: It is one of the mysteries of storage: how long do disk drives last? An online backup vendor with 75 petabytes spills the beans.

TOPICS: Storage, Cloud, Hardware

Well, the data isn't quite ready. It turns out that Backblaze – of open source storagepod fame - are only five years old and don't have enough failed drives to give us a definitive answer, yet.

But the answers they do have are worth considering. 75 PB is a large sample.

The stats
Backblaze has currently a total population of approximately 27,000 drives. Five years ago that number was about 3000. However, they've kept track of all the drives and found some interesting things that at least partly contradict earlier research done with Google by Carnegie Mellon university. (See Everything you know about disks is wrong.)

They measured annual failure rates. If you have 100 drives for a year and five of them fail that is a 5% annual failure rate.

In the first 18 months drives failed at the rate of 5.1 percent per year. For the next 18 months drives failed at the rate of about 1.4 percent per year. But after three years failures went up to 11.8 percent per year.

While that sounds bad the good news is that after five years almost 80 percent of drives are still working. Which explains why they don't actually have an answer to the question how long drives last.

But extrapolating from their experience they believe the median lifespan of a consumer drive will turn out to be six years.

The Storage Bits take
It is refreshing for a large-scale user of hard drives to break the industry code of silence and tell us their experience with a large population of disks. Lots of companies have this information - I'm looking at you, Google and Amazon - and simply refuse to share it.

But, if like me, you only buy a few drives a year, this information may not apply to you. You can get drives from a batch that were marginal or dropped during shipping or poorly handled and you might see several drive failures from single batch.

Or you might have drives that last for 10 years. The important thing to keep in mind is that in five years you can expect at least one in five drives to fail.

The bottom line: backup, backup, backup. Accept no substitutes.

Comments welcome, of course. Backblaze would like you to back up with them. I have no commercial dealings with Backblaze, but I like what they do. Read their entire blog post here.

Topics: Storage, Cloud, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • That's not very good, is it.

    That's a very high failure rate.

    It shows that mechanical disks should not be trusted at all. Once the glue wears out, you're done.
    • SSD

      I've had about the same failure rate on SSD drives, and the speed drops quite a bit after a couple years. Unfortunately, when an SSD goes, there's no way of recovering any data. At least I can have someone swap out a platter for me if the data is that important.
      • Re SSD recovery

        I don't know if maybe it's just the brand (both were OCZ Vertex 64GB, two different models) but I have had a number of SSD installs crash (about 4-5). I have never been able to recover a backup to an SSD -- even though I run two unrelated backups, Acronis and EaseUS. In each case I wound up having to do a full reinstall.

        The problem is definitely not the backup software. In one case I wound up recovering the backup to a PATA disk that works fine.

        The first time it happened I tried OCZ's suggestions, none of which worked, and wound up having to RMA the SSD. When I tried doing the backup recovery to the new drive the recovery failed. I wound up doing a fresh install of Ubuntu (a lot faster than reloading Windows) and that works fine.
        • Perhaps corrupted backups

          SSDs behave exactly like a spinning disk. Except some S.M.A.R.T. attributes and the much smaller access times there is no way for anything to tell the difference. So your backup was probably just not ok. With backups, you should always test it, before declaring the backup successful. Sometimes this is difficult in small installations.

          SSDs have improved tremendously over the past few years. However, they are still improving, both in performance and in reliability.

          For server usage, I have declared drives consumables. They will die one day, no matter what. The question is not if, but when. Therefore, I run no server without storage redundancy and when the drive die, just replace it (of course, backups are an entirely different and unrelated task). I started using SSDs for caches and transaction logs (I am heavy user of ZFS) and threat them the same way: as consumables. When it wears out, just replace it. Thing is.. with the improvements in technology it is becoming increasingly difficult to wear a SSD these days.

          Recovery from a SSD is harder, for many reasons. The technology is also very much in flux and not many invest in tooling to make that process easier.
        • My experience w/ OCZ SSDs is similar

          I have a pair of OCZ 120GB Agility3 SSDs. I was having issues using them in a RAID1 mirror for a Win7 install. After jumping through more hoops than I thought was necessary, they finally RMA'd them. I received a pair of refurbed Agility 3s. These behave a bit better, but I still have the odd issue now and then even with these.

          I'm on the fence about requesting an RMA again. I'd rather not have another pair of OCZs. But, I'm stuck with these for now. I definitely won't be buying SSDs from OCZ again.
        • A little different

          I set things up a little differently usually. I usually use a CF-to-SATA or PATA adapter (depending on era) and install an 8GB, 16GB, or 32GB (depending on era) CF module with fast read time. I don't care about write time because I load the o/s there and almost never write to it (no_atime set). I then install either a spinning platter hard drive or a second CF module with a fast write time and partition this to be swap, /var/log, and /tmp. If it's the spinning platter type, I also usually add in /home.

          I have at least 1 system running on an 8GB CF module with a 4GB CF swap module. (One of the programs demands a swap partition. Shame on those developers.) The 8GB module has never needed to be replaced in the 4 years that system has been running. The 4GB swap CF has a life-time of about 18months. That system is on top of a mountain, at the base of a radio tower. It was the first system we tried this combo with. We had tried hard-drives there, but no hard drive ever lasted more than 14months. They don't like the weather, power outages, or power fluctuations, even with our power-hardened UPS.

          When the swap drive fails, the system still boots and runs (except for the one program that demands the swap partition which fails with all sorts of error messages filling the console screen), and I can log into it and diagnose it from a distance. And then I know what I need when I head up there. It's turned out to be such a great system, I'm now implementing it on almost all of my linux and BSD systems.

          It would have been nice if MS had made it easier to place the Swap file and the user directories on separate partitions. (I know it can be done, but I said "easy".) [And it would be nice if the Linux/Unix community had organized their file system differently based on expected writes.] Of course, no one thought about that kind of thing in the early days of computers so we're stuck with the legacy for now.
        • SSD predictability

          You can predict when it is going to fail, meaning that you'll have a chance to back up just before it fails. HDD failure is completely unpredictable.

      • Interesting. I think most SSD makers claim a MTBF of 250,000 hours ...

        ... makes you wonder.
        M Wagner
    • The six-year figure sound reasonable to me.

      There are roughly 6,000 hours in a year (running 24/7). That comes out to be about 36,000 hours over a six-year lifespan Most HDD makers claim a MTBF of 50,000 hours so this seems fairly consistent - and I have replaced enough hard drives in the last 33 years to feel very comfortable with that figure.
      M Wagner
  • I think it depends on usage scenario as well.

    In a server array drives have a pretty constant temp and they are more likely to be spinning constantly as opposed to a lot of cold starts in a desktop/laptop scenario. Remember these things are made of metal, and metal just loves to expand and contract with temp. I would expect lots of cold starts to be more detrimental to drive life than constant hours on so I would suggest a desktop/laptop drive to fail even sooner at 8 hours a day use then turned off overnight.
    Alan Smithie
    • My experience aligns with your suggestion

      I have servers still running 15-20 year old drives. Yes! The key is to avoid restarting the drives frequently. I have lost more drives during shutdowns and restarts, than during normal operation. I believe this is why in recent years drives are rated to number of load/unload operations. Or spin-ups.

      Drives generally fail when they are new, perhaps because of manufacturing defects or semi-damaged during shipping.

      But most frequently, drives have died on me during power fluctuations or because of faulty power contacts. Quality power supplies and connectors are therefore essential if you have large arrays.

      Because of considerations like this, I run most of my desktops and workstations completely diskless. Today's Gigabit Ethernet is plenty fast (and dirty cheap). Disks sit on servers, where they spin all the time, in controlled conditions, such as power and temperature, no reboots etc. I can also leverage the large number of spindles and SSD caches/accelerators to have negligible seek times etc.

      Funny enough, in the last few years I have seen more failures from server grade SAS disks, mostly Toshiba drives, than from consumer grade SATA drives.
  • The good news

    The good news is that hard drives are now cheaper per gb than Dual Layer DVDs for data archiving. I can purchase 2 @ 3TB hard drives and mirror the same archived data on both drives and it's still cheaper than burning them to Dual Layer discs once. The risk of both drives failing is - in my opinion - less than having a Dual Layer disc become unreadable.
    It's more convenient to access the data as well.
    • Absolutely

      I have given up archiving on DVDs years ago. Not only are HDDs cheaper, provide random access etc, but they also occupy much less space for the same amount of data.

      Most DVDs become unreadable in just few years.... With HDDs, you can always rewrite the data.

      Plus, you can use redundant storage like ZFS, where you use multiple drives in a single redundant volume and always know when data is corrupted or it is self-healing from the redundancy. You just need an compact multi-bay storage box.
    • Except that ...

      Online solutions like RAID and non-RAID mirroring are great. I use them extensively. But they do not protect against a whole array of possibilities like, for example, accidental deletion, malware vandalism, intrusion, corruption caused by things like RAM failure, etc. I admit these are remote possibilities, but they ARE possibilities. Which is why I also back up to bluray. Optical disk backups may indeed go bad, but they don't simultaneously corrupt or suffer data loss with the system drives with the system drives. As long as you can read the disc, it is a guaranteed accurate snapshot in time, no matter what has transpired on the system itself. It is, uniquely, like a backup on tape or whatever, a true backup.
      George Mitchell
  • There is weight in numbers.

    It is interesting to see how the vendors answers change when you have statistically meaningful failure rates about their product.
    Of course I'm never sure if the person I'm talking to is just lazy to give me an accurate answer or just "towing the company line".

    Years ago we purchased 96 identical PC's in one order & used them in exactly the same way. 1/3 failed within 2 weeks. I called the vendor & asked if they'd had production issues or anything negative with their new design machines. They swore everything was perfect.
    When I mentioned I was looking at a dead machine. They insisted I was the only one in Australia & it was most unusual.
    When I gave the complete statistic & said I'd lost confidence in their products. I heard a sigh & then honesty. The vibrations in shipping from Singapore were causing hairline fractures in the motherboards & daughter-boards. Due to poor mounting point design all devices were likely have their lifespan compromised. They were hoping to avoid a million dollar product recall, by just replacing those which died within warranty

    I could give other examples. My suspicion is the 1st year 5% failure relates to saving money by not testing all aspects of every product, but just sampling. The real failure rate for "Successfully manufactured" drives would be a fraction of 1%, rising to ~1% in 1st year & so on.
  • What kind of drives

    It would be helpful to know what kind of drives were used in the study. If different drives (assume they were not all the same mfr, size, etc.) then what were the broken down failure rates across the matrix? For example, SATA drives typically have a lower MTBF vs. Nearline SAS drives.
    • Stay tuned...

      We did use a variety of drive manufacturers, sizes, etc. We're digging in to see if we have captured the data in fine enough detail to be able to share results at this level.
      • Detailed stats would be very helpful

        Obviously make and model statistics would be helpful to the world. It would also be interesting to know if there is a correlation between infant mortality and long term reliability. One of the problems with picking drives is that you can't assume that one manufacturer or another has better reliability based on experience with older drives because each generation of drives can behave very differently from their predecessors. If it turns out that there is a correlation between infant mortality and long term reliability we could make some assumptions about current generation drives based on short term statistics. Of course if it turns out that the two aren't related then the short term stats won't predict anything about the long term, however they would still be the best way to choose a drive because the short term failure rates are so high (BTW my experience is worse than you've reported but my numbers aren't statistically significant because I only buy a handful of drives each year).
  • I can't recall the last drive I've replaced.

    Not saying my anecdote means anything or is a usable sample size but I can't recall the last hard drive I've replaced. With that said I am very much in the MULTIPLE (meaning one off site) backup camp.
    • Not too worried...

      I've owned several hard drives since 1988 and have had maybe three that have failed, with only one in such a state that I could not get anything off it (I have backups anyway). I had to replace a hard drive in my NAS last winter - first time in a while. It had been in the NAS running 24/7 for nearly three years, and had started its life in a PC that I had for about five years prior to moving it to the NAS.