Enterprise disks: worth the money?

Enterprise disks: worth the money?

Summary: Enterprise drives claim million hour+ MTBF's at a hefty price premium. Consumer drives make do with ≈400,000 hour MTBF's. But is there really a difference? New research says no.

TOPICS: Storage, Cloud, Hardware

The enterprising folks at Backblaze continue to surprise. In a blog post released this morning Backblaze talks about their experience with enterprise drives.

Annual Failure Rate or AFR is the preferred metric for measuring drive life. Unlike MTBF's it is easy to measure and is readily translatable into expected behavior. AFR is the number of drive failures divided by the number of drive years.

One drive running for 12 months is one drive year. Or 12 drives running for one month.

Backblaze currently runs about 25,000 Consumer SATA drives. They have almost 15,000 drive years and have replaced 613 drives.

They have a smaller number of enterprise drives: six shelves of Dell PowerVault storage; and one EMC storage system with 124 enterprise drives. They also have one Backblaze storage pod – 45 drives – running with enterprise drives for experimental purposes.

The enterprise drives are a mix: SAS 15k RPM drives; SATA 7200 RPM; SAS 10k RPM; and a few SSD drives thrown in for good measure. With the exception of the Backblaze storage pods all of the drives are installed in enterprise-class enclosures with excellent cooling, vibration dampening and high quality power supplies.

Enterprise RAID drives are designed to limit retries so a single failing drive doesn't drag an entire LUN down. But this doesn't seem to make a difference in observed drive failure modes.

As Gleb Budman, Backblaze CEO and co-founder put it in an email to me:

We have limited visibility into the drive stats in the commercial storage systems as the vendors have chosen to not expose the SMART stats. We do get errors in our logs from the enterprise drives just as with the consumer drives and these are typically read/write timeout errors that do not appear to be qualitatively different between the drives.

They have accumulated 368 enterprise drive years and had 17 failures. That is an AFR of 4.6% vs. 4.2% AFR observed on consumer drives.

The Storage Bits take
There are probably a couple of thousand engineers in the storage industry who know the facts about the actual failure rates of different drives and different manufacturers. That's because both vendors and OEM buyers track it.

But until Backblaze came along no one was willing to talk about it. Since Backblaze buys its drives on the open market and they like publicity they can talk freely.

Backblaze's conclusions are not surprising. The only other major study (see Everything you know about disks is wrong) also found no significant difference in drive life.

So what does this mean to you? Simply this: focus on the cost and performance of drives, not their putative MTBF or warranty periods.

Disk drives are incredibly complex precision devices that make the finest Swiss watch look like a dump truck in comparison. And they cost a lot less.

Yes, drives fail. But you should always have at least two copies of your data on different drives. Disk storage is cheap: buy plenty!

Comments welcome, as always. Read the Backblaze post here. What has your experience been with enterprise drive reliability?

Topics: Storage, Cloud, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • It's not apples to apples

    SAS 10k and 15k drives spin much faster and likely get constant use. A big SATA drive just sits there with archived data being unused.

    Kinda like a race car engine last only one or a few races.
    • But the bottom line is

      how is the MTBF difference justified? Enterprise drives are designed for heavy use, have a much higher MTBF and yet - somehow - they have the same failure rate as consumer drives.

      Something isn't right!

      R Harris
      • Lies, damned lies and statistics.

        Along with inches of display and "GB" vs "GiB" storage capacity, MTBF is manipulated by evil marketers. The "F" in MTBF in particular. MTBF calculations de-emphasize "infant mortality" cases and manufacturers' testing may focus on the performace of the moving parts (how many read or write errors from or to the platters, beaaring failure, head failures, etc) and not adequately replicate real-world conditions. On a bench-top the temperature might be lower than in a fully loaded server rack, there could be power harmonics, and a server might encounter an unforseen bug in the firmware of a drive that might brick the unit.

        Multiple studies have shown that a good rule of thumb with MTBF numbers is to divide by three to account for proper weighted averages, more realistic environments, etc. Also a good mitigation for failure risk is to use RAID 6 or some other configuration that can tolerate 2 or more failures, use larger number of smaller drives in RAID so rebuild times are shorter, and to use JBOD instead of traditional RAID and handle data replication/faut tolerance at the software level (LVM, ZFS, Gluster, etc).
        Mark Hayden
      • Even less reliable in my experience...

        I made a decision to upgrade the specification of the hard drives in all new machines by cloning the originals to WD 250GB RE3 "Enterprise" disks and keeping the original Seagate ones for emergency recovery. What a mistake that was - nearly every one of about 20 has failed within 12 months of installation. Luckily Western Digital has honoured the 5 year warranty and sent me replacement disks. The machines are EPOS tills which run 24/7 365 days a year under shop counters which often get too warm but still that is what I thought "Enterprise" meant - in other words industrial quality. Other consumer disks from Seagate running in the same scenario last 5 or 6 years. Some old 20Gb 5400rpm Seagate have been running for 10 years non-stop in the same environment! I know it is a smal sample of about 60 machines in total but that is my 2 cents worth.
        • EPOS systems a typical enterprise environment?

          Those EPOS systems may not be a typical environment for enterprise drives, and maybe that is the problem. The designers of enterprise drives make assumptions about the environment in which they are intended to run. Look at a server room. Good ventilation. System chassis built to minimize vibration. Compare with a retail environment. System shoved under a counter and who knows how hot it gets under there. And then employees bump and kick the computers by accident. Plus Enterprise drives with their faster RPM and data transfer rates are certainly designed with much smaller tolerances than garden-variety drives (or 20GB Seagates!)
    • I sure is "apples to apples"

      "Spartan to Red Delicious" perhaps, but still apples-to-apples.

      The typical use case is irrelevant here; the testing methodology is the same for both in this study. All the drives, Enterprise-grade or not, in this study have been placed in a controlled data centre environment and were run constantly over a long period of time, under similar loads. So what if the average purchaser woudn't do that if the test itself is consistent across all drives?

      The only difference I see that might justify the cost is the RPMs. All drives seem to have roughly the same annual failure rate, but SAS enterprise drives spin up to twice as fast, so their parts could be considered "twice as good" as they will have moved twice as much over the same period.

      However, this gives purchasors some food for thought. If the drive throughput is not the bottleneck in the application and storage capacity is the prime requirement then why go with enterprise drives? It is clear that the price differential does not meaningfully increase reliability from these two studies. It provides some good evidence that in applications where capcacity is paramount and the bottleneck is elsewhere (such as network bandwidth) then the cost of enterprise drives is not justified.
      Mark Hayden
      • Agree 100%

        Mark - your points are spot-on. In our case, all the drives are running 24x7 as designed for their use case.

        Also, we specifically want "slow" drives for most of our storage since they are lower power and properly architected backup does not require high-speed from any single drive.
      • agreed, but

        You need to account for the different warranty manufacturers give for enterprise and consumer drives. Enterprise drives usually have 5 years warrany, consumer have 1-3 years. Taking this into account, I always calculate costs vs benefits as out of warranty drives live borrowed life - so I have started treating drives as consumables - they have certain capacity/lifespan at certain cost. If they live more, that is a bonus.

        As for speed, in large installations with hybrid storage adding few SSDs can make an array of consumer disks perform better, than an array of enterprise disks and no SSD. And the hybrids with enterprise disks and SSDs are not performing much better.
  • Hmmm

    I did recently purchase a Seagate Constellation for a home machine, but I based the decision more on its excellent ratings at NewEgg and not-too-ridiculous price than the mere fact that it was an enterprise-class drive. The five-year warranty is nice, though, as I don't make a lot of drive purchases and would really rather not shell out for a new one for as long as possible.

    It seems to me that a good drive is a good drive regardless of how it's categorized, and if a particular model has a reputation for performing well and reliably in real-world scenarios then it may be worth a price premium on that basis alone.
    • No drive is a "good" drive

      Ginerva - looking at the ratings of a drive on NewEgg is a perfectly valid way of deciding which one to buy. However, no drive is "good" in terms of "it won't die" => they all will at some point. I, obviously, would recommend a good backup ;-)
      • Well, yes.

        Everything breaks down eventually, and I keep redundant backups even of hard drives whose integrity I trust (within reason.) But I do draw a significant distinction between models whose owners report that they goes belly-up in large numbers within a few months or a year and models conspicuously lacking such reports. Obviously it's possible to strike it unlucky with any hard drive you purchase (and I have before,) but to whatever extent it's possible to minimize one's risk, doesn't it make sense to do so?
        • Yes, pick the best drives & backup.

          Ginerva - glad to hear you do backups. And yes, totally agree that when buying a drive, having some evidence from people that their experience has been good is excellent.

          Stay tuned also; we're hoping to post an analysis by drive vendor/type which ones work better based on the 25,000 drives at Backblaze.
  • Huh.. I'm scratchinig my head.

    I've been a storage admin for about 14 years. I've had everything from EMC's NS120's, VNX, Symmetrix's(Sym5, DMX, VMAX,etc), DataDomains, Centerra's, and over 30 physical servers. I can tell you without hesitation that Consumer grade dries fail at least 4 times as much as enterprise grade drives.

    However, I can also tell you that your 15k drives really like to be cool. So if you expose them to rougher conditions(heat) it will make them more even as the slow spinning 7200 rpm drives like that more.
    • The data seems to disagree.

      Hey John, interesting to hear about your experience. However, based on running 25,000 consumer drives and the (definitely more limited) Dell & EMCs full of enterprise drives, the data shows the consumer drives to work better.

      Also, as for heat, Google did a 100,000 drive study that generally heat helps drives (rather than hurts) as long as the temperature is within the drive specifications. That was based on consumer drives, so it is possible that heat affects enterprise drives more, but that would point even further to consumer drives being better.
  • SMART data ain't so smart

    My experience with all brands of drives indicates that many vendors pay lip service to SMART, which is supposed to be an industry standard, but often comes up short in defining the current operational state of a running drive. This is not good for the disk drive industry. They gotta get better at SMART reporting.
    • SMART coverage isn't that helpful

      SMART only looks at ≈50% of the issues that signal imminent drive failure - those related to reading and writing. But the rest of the drive is subject to sudden component failure that SMART can't predict.

      Bottom line: SMART isn't much help.

      R Harris
      • SMART isn't any help on SAS...

        Many of the enterprise drives you're talking about above are likely SAS interfaces, and SAS does not specify what SMART parameters (other than temperature) the drive manufacturer must support. There is a SMART (it's called IEC on SAS) log page, but the only standards based reporting there is temperature (although at least some of the drive OEMs require expanded reporting in the IEC log page).

        That said, the SMART reporting that happens in SATA drives is virtually useless too - the 100 or so defined parameters wee chosen with little information about what their effects on drive failure actually were, and there hasn't been an update that I know of since SMART was defined. That's one of the reasons for the new drive status reporting system in the latest SATA specs (SATA 3.1, I believe), which has a different set of parameters reported.

        And even after all that, if you ask the drive manufacturers what they should be monitoring to report drive failure reliably, if they're being honest they'll say "we don't know, but we're trying to figure that out". The best guess I've seen is the rate of block failures (and redirection) in the drives, and even that isn't that great a measure.