HDD warming: global data threat?

HDD warming: global data threat?

Summary: If there's one thing I hate, it's unsettled science. For instance: the effect of temperature on disk drives. Shorten their life or not? Most studies say no - including a new one - but Microsoft researchers disagree. Can’t we all just get along?

SHARE:
TOPICS: Storage, Hardware
13

The folks at Backblaze published a detailed blog post on observed effects of temperature on disk drives. Like most studies, they didn't find one:

After looking at data on over 34,000 drives, I found that overall there is no correlation between temperature and failure rate.

But then they ruined it - damn you, Backblaze! - by linking to a study by Microsoft and UVA researchers who DID find an issue. That blew my day as I had to, you know, look at the data and THINK.

Hate that. But here goes.

The Backblaze data
Backblaze looked at 17 drive models from Seagate, WD, Hitachi and Toshiba. Author Brian Beach used a point-biserial correlation coefficient on drive average temperatures and whether drives failed.

He found one drive - a Seagate 1.5TB Barracuda LP - that had a weak but statistically significant correlation between failure rate and higher temperature. The Annual Failure Rate (AFR) doubled from cool drives to warm (above average temperature) drives. But because so many continued to work fine at any temperature, the correlation was weak.

Two more models, a Seagate Barracuda 3TB and a Hitachi Deskstar, showed weaker correlations - but in opposite directions. The Hitachi failed slightly more often at 21°C than at 31°C, while the Seagate failed slightly more often at the higher temperature.

Oh great! Now too cold is bad too.

Microsoft/UVA study
The 2010 Microsoft study, Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures by Sriram Sankar, Mark Shaw and Kushagra Vaid of Microsoft and Sudhanva Gurumurthi, U of Virginia, came to very different conclusions:

1) We show strong correlation between temperature observed at different location granularities and failures observed. . . .

2) Although average temperature shows a correlation to disk failures, we show that variations in temperature or workload changes do not show significant correlation to failures observed in drive locations.

3) We . . . show that Chassis design knobs (disk placement, fan speeds) have a larger impact than tuning Workload knobs (intensity, different workload patterns), on disk temperature.

4) With the help of Arrhenius based temperature models and the datacenter cost model, we . . . show that datacenter temperature control has a significant cost advantage over increased fan speeds.

Here's a couple of relevant tables:

ms-uva_temp_vs_afr
ms-uva_temp_acceleration_factor

Drive vendors have their say
Most drives today are spec'd at a 60°C (140°F) or even 70°C (158°F) operating temperature. Per the MS-UVA study, it is the average temperature, not variations in temperature, that affect drive life the most. If drives get really hot once in a while, not a big deal.

And hey, they say they'll operate, not that they'll last.

Reconciliation, to a point
Look at the data: Backblaze temps stop at 31°C while the MS/UVA study showed that AFR's are relatively flat up to 33°C and then start climbing. Not much disagreement between Backblaze and MS/UVA.

The Storage Bits take
One of the most popular myths about disk drives is that they are very sensitive to temperature. That may have been true 20 years ago, but it is clearly less so now. The drive vendors seem unconcerned as well.

Given that most users have a few dozen mixed age/vendor/chassis at most, these statistical musings have little predictive value. If you are running a data center and have thousands of drives, you should do a more careful analysis of the tradeoff between energy costs and increased disk failures.

The hidden storage market - between the 3 drive vendors and 8 or so Internet giants - is driving storage requirements now, not PCs or the enterprise. These warehouse scale systems are designed to tolerate drives failures gracefully, much more so than most enterprise infrastructures. 

Eyeballing the stats from these and other studies, most enterprises should aim for about 35°C (95°F) disk temps in temperature controlled data centers. Save money and reduce global warming.

Comments welcome, as always. Scientists, always picking at each other: feature, not a bug. People who say "settled science" don't understand science.

Topics: Storage, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

13 comments
Log in or register to join the discussion
  • Piffle.

    There are two places disk drives can fail

    1. bearing.
    2. heads.

    Spindle bearings fail if the lubrication fails - and the oil used is designed to work across a fairly wide range. As oil ages though, it has a tendency to cause failures when cool... Which is why disks fail to spin up after flawless operation for years.

    Read/write heads have a problem not directly due to to temperature, but due to deformations of the surface they are flying over caused by uneven temperatures. The lubrication on the arm can add to the failures - but this oil gets very little use compared to the spindle.

    Most (not all) of the disks I have have worked for over 10 years without an error... The only time they have been off is when there is a lengthy power failure (short term failures are covered by the 1KVA UPS unit, which has enough power for about 1/2 hour for this particular system).
    jessepollard
    • Yes, and . . .

      Roughly half of disk failures are electrical, not mechanical, in nature. That's why data recovery firms keep a copy of every disk model they can, because if they replace the electronics they have a good change of a very quick recovery.

      Robin
      R Harris
  • This would be a concern if global warming were real

    “The fact is that we can’t account for the lack of warming at the moment and it is a travesty that we can’t. The … data published in the August … 2009 supplement on 2008 shows there should be even more warming [sic. - data actually shows cooling for past decade]: but the data are surely wrong. Our observing system is inadequate.”
    - ClimateGate email

    “Climate change [provides] the greatest chance to bring about justice and equality in the world.”
    “No matter if the science is all phony, there are still collateral environmental benefits”
    - Christine Stewart, Canada’s former env. Minister

    French ex-President Jacques Chirac: the Kyoto Protocol represents “the first component of an authentic global governance.”
    harvey_rabbit
    • Completely off topic

      Robin is talking about HDD warming, which is a different subject.
      John L. Ries
  • Uh-huh.

    Global warming is real. CO2 is the major reason. But deniers got to deny.

    Gotta love my fellow baby boomers: leaving the world worse for our kids and we don't care.

    Robin
    R Harris
  • This is hardly a statistically rigorous study

    The range of temps reported is quite narrow and clearly there are other factors that are driving failure before temps become an issue.

    I think the conclusions would be a lot different if the range of temps reported were btw -100 C and 100 C. This highlights why carefully designed studies are much more informative than just having a bunch of operations data at your disposal.

    IOW the only conclusions that Backblaze can draw apply to them only. I have read their other reports on drive reliabilit, and sadly, as much as I would like to, I just can't apply their conclusions to my own personal use case.
    CornheadsBack
    • Is this still true? That sounds like old information.

      Back when I was working in data recovery (Mace Utilities then owned by Fifth Generation Systems) that was definitely true but electronics tend to be more reliable now even if more complicated. New soldering, better part placement and SMT have improved things quite a bit. Moving parts and high precision tend to work against each other.

      Did Blaze give any indication of root cause of failure?
      MeMyselfAndI_z
      • IIRC

        The problem that Backblaze highlighted as the most severe is vibration when they have so many drives spinning in a rack. But that ignores infant mortality.

        My recommendation is head over to their blog and read up urself. I mean, kudos to Backblaze for doing what they do. But only you can determine if their experience has any relevance to ur use case.
        CornheadsBack
    • That range is clearly outside the design range.

      The tests should only cover the spec'd range not extremes. And the range of temps has little to do with statistics. You either have correlation or you don't. That said the range of temps seem quite reasonable as they mirror real-world ranges.

      The cover a lot of drives and recorded the metrics. I'll take that over guesses any day of the week.
      MeMyselfAndI_z
  • I've ran hdd's in heated environments and have...

    a program that notifies me if it goes past 50C. I've recently gotten a tower with built in fans so now i can keep the drives down to 25C. And i haven't seen too many desktop drives go bad although i have a few older ones that are having issues which their time is probably shortened now that they are experiencing problems.

    But i think the reason hard drives fail is because they are working too hard, get bumped and the excessive heat or cold. If your blowing ice cold wind or not blowing it at all and mess with the temperature fluctuations too much like going from 10C to 80C, could possibly wreck the components quite a bit. Don't start a cold hard drive that is going to be going up to high temps and ones that are subjected to high temps should not be lowered. I'm pretty sure the materials inside a hard drive do contract and expand as temperature fluctuation occurs. That shouldn't be a problem though if you try to maintain that temp since it just started up.

    I haven't had too many drives fail on me, but generally i think they fail due to the fluctuations which if you take a PC from the garage and try to boot it up before letting the drive get to room temp first could cause problems. There are fluctuations like from 20C to 30C but the less extremes you have the better chance your drive will make it out alive. Though i have found the best to have a few handy and buy a new one within a 10 year period to use as a backup. If you got a 5+ year old drive if that drive fails, you'll want another one keeping the data safe and generally if that drive is 10+ years old, it does no good if your drive dies because you choose to use the newer drive and store your info on an older drive instead.
    MidnightDistortions
  • The data shows three discint data ranges

    Here's my 2 cents, as an ex-statistician.

    From the Microsoft chart, there seem to be three very different temperature ranges involved in the HDD failure rate:

    (1) Below an average of about 23 C, failure rates of HDDs are negligible
    (2) At about 23 C, the failure rate jumps, but is independent of specific temperature in the range of 23 to 33 C. This is totally consistent with the Backblaze results.
    (3) Above about 33 C, the failure rate climbs linearly with temperature.

    All this suggests the two studies are consistent. It also suggests three things:

    (1) Keep your average HDD temperature below about 23 C, and you are just fine.
    (2) If you are unable or unwilling to do that, keep them below
    33 C
    (3) If you are unable or unwilling to keep them below 33 C, you are an idiot, and deserve everything that you get.
    Ian Easson
    • Didn't know there was such a thing as an ex-statistician

      There are inactive statisticians and retired statisticians, but the word "statistician" describes expertise, not occupation.
      John L. Ries
  • Of course Microsoft doesn't want people

    To use local HDDs. Microsoft wants you to store everything "to the cloud". Never mind you'll actually still be storing it on a HDD, just Microsoft will have unfettered access to it 24/7/365
    I hate trolls also