What is the primary determinant of drive life? I've read the latest research and talked to insiders. There are so many variables that the best answer is just . . . luck. Why is that? Is there anything you can do?
- Enterprise drives are no more reliable than consumer drives. The extra money seems to go for margin and warranty costs. These are mass-produced products. There is no secret to making a disk last 3x.
- SMART drive status reporting is pretty useless. If SMART is telling you there is a problem, there probably is a problem. But if it reports no problem, that means nothing.
- Workload has no effect on drive life. So use it all you want. Google did find that a heavy workload increases infant mortality, so when you install a new drive work it hard so you can replace it while it is still economic to do so.
- Ambient temperature has very little effect on drive life until it gets up over 104 F. or 40 C. Even then the effect is slight.
So what does affect drive life? The research shows that drives are much less reliable than vendors commonly claim. There are two major reasons for this discrepancy:
- Vendor numbers are based on accelerated testing, which means high-temperature operation. That just isn't a very good simulation of real life, especially real consumer life. But it may explain why drives aren't much affected by temperature.
- A high percentage of failed drives report "no trouble found" in vendor testing. This probably reflects the quality of the testing more than anything.
The top issues in drive life:
- Drive age. There is some infant mortality, but not much. The big issue is that once drives are three years old their annual failure rates skyrocket.
- Handling. Dropping a drive is a bad idea, even a couple of inches onto a table. I saw evidence in the 1990s that found that reducing drive handling to the minimum required to install it improved reliability by 20% or more. There have been many improvements in shock specs since so this may be less valid, but it still makes sense. Drives are mechanical devices. Don't knock them around.
- Early production quality. Can't wait to get your hands on that new 4 TB, 15K drive? You could be buying a problem. The first three months of a new drive's production typically has a higher failure rate. After that the factory line settles down and quality goes up.
- Statistical variation. Google and CMU looked at 100,000 drives each in their studies. Most of us have very small sample sizes and don't keep very good records. But the data shows that drives fail for no apparent reason at all ages and in all environments. A drive can fail at any time without warning.
That's where the luck comes in. Here at Chez Mojo I had four working disk drives at the beginning of last week. By the end of last week I had two. Different vendors, environments, enclosures, ages, everything.
It was just my bad luck. And normal statistical variation.
What about vendors? I think there are differences, but the conspiracy of silence among big drive consumers, including Google, means data is sparse. But I have some ideas on that for a future post. Stay tuned.
Comments welcome, of course. In a moment of brain cramp I left out the point about early drive production. I added it Friday morning.