Could drive vendors end transient drive failures?

Could drive vendors end transient drive failures?

Summary: Drive vendors have complained for years that half of all "failed" drives have no problems when tested. Maybe they need to build better drives. Here's a suggestion.


For several decades, drive vendors have defended the reliability of disks by noting that half of all returned drives worked just fine when tested. But why would customers go to the trouble of returning a perfectly good drive?

The problem is so widespread that there is a storage company mdash; X-IO — whose products can do a factory-level drive format on in-place drives to recover from transient drive failures. As a result, X-IO guarantees their storage performance, capacity, and availability for 5 years at no extra charge.

Obviously there was, and is, a disconnect somewhere. But where?

It's in the stack

Most industry people assume that the transient failure problem lies in the hardware and software stack. Disk drives have powerful microcontrollers running hundreds of thousands of lines of code and are expected to connect to thousands of versions of operating systems and file systems.

Every system vendor maintains drive qualification groups whose sole job is to ensure that a particular version of drive firmware works reliably with that vendor's OS and I/O stack. Once qualified, the vendor will insist that all drives continue to use that particular version of the drive firmware.

Is it any wonder, the thinking goes, that this causes problems?

A different view

Yesterday, I came across an excellent blog post about false disk failures from a senior LSI technologist, Robert Ober, that had a different view. Ober is a processor and system architect at LSI and holds dozens of patents.

In relating the experience of a large internet datacenter, he wrote:

... about 40 percent of the time with SAS and about 50 percent of the time with SATA, the drive didn't actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why?

He went on to relate his experience working on engine controllers, which is, he said:

... a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that's millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced...

So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get [the controller] operational again in less than a full revolution of the engine.

Disk drives don't include such protections. But maybe they should.

The Storage Bits take

The cost of transient drive failures is huge for both drive vendors and customers. Pulling out and returning a "failed" drive is expensive in time, money, and lost compute time. I'm sure vendors don't like the hassle either.

If a vendor could reduce their transient failure rate from 50 percent to 5 percent — still way worse than engine controllers — they could save themselves and their customers millions of dollars every year. The competitive advantage would be huge.

So why don't they? Do they not know any better? Or is the task of rewriting all that code to incorporate failsafe technologies just too daunting?

If I were running Seagate or WD, I'd want to know what the engineers could do to make transient failures much more rare. And while it would be a 3-5 year slog to make it happen, I'd do it.

Storage users deserve the best vendors can do. And right now, it doesn't look like we're getting it.

Comments are welcome, as always.

Robin Harris is currently doing some work with LSI and Robert Ober, which led to Harris reading Ober's post. Harris also previously worked with the CTO of X-IO 20 years ago.

Topics: Storage, Hardware, Servers, Software Development

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Good article

    As someone who spent last night rebuilding a system with a failed drive, you've got my vote! It's funny how we don't get the same tools anymore either. What happened to the low level format tools? Most of us would be just as happy with the drive once the bad sectors are marked.
  • Probably a financial balancing act

    Regarding, "why don't they do better", it's probably a financial balancing act. Do they keep up with their competitors putting more and more data in the same physical space or do they spend the money and efforts for something that by its very nature is almost impossible to test. Since these are intermittent and not repeatable, how can they be certain a host of new precautions are in fact making any difference?

    And which is more likely to impress customers?--"This drive holds double what a drive 3 years old could hold" or "There's literally a one-in-a-trillion or lower chance there could be a non-repeatable momentary failure on our drives and we improved that to one-in-a-quadrillion. Of course, to store the amount of data you want to store you'll have to buy TWO of our drives rather than one of our competitors, since we were concerned about improving reliability rather than storage density."
    • I think you nailed it.

      Writing fault tolerant code is tough. I mean really tough. I have been doing avionics level A safe code for 15+ years and I would say 70% of our logic and debugging effort goes into fault logic. Often trying to simply create the fault will takes days of setup. This leads to fault code that has significantly less run time leading to its own risks.

      Great write up.
    • Definitely financial, but they don't want to pay for their mistakes

      I've worked in the Windows kernel as a consultant for close to 20 years. In that time I have seen the major driver and controller vendors ignore problems that cause transients. My favorite was a major vendor's VP who sent ouf an email after developers claimed that drive's microcontroller had a bug in its code that caused the Windows driver stack to crash. The email said: "This would cost us money to fix and besides it is Windows we will just blame it on Microsoft's code". What makes it really funny is the bozo did not know the email system well so sent it to a few hundred people outside the company. What makes it sad, is years later with changes in drive capacity etc, the bug is still not fixed!!!!

      Most of the big drive manufacturers do not care until their reputation is really hurt.
  • In the early years...

    We have been relying on fault-tolerant data processing for decades.
    I wrote format utilities in the early PC years. Back then the drives from IBM had lots of errors, partially from floating electrical grounds and other noise sources. The IBM format program would not count a formatted sector in error until more than 11 bits were bad when reading/verifying a sector. Mind you, 11 bits sounds like a lot, and it is, but there are built-in error correcting algorithms to fix them. The drives my company were using did not have the noise errors and rarely, if ever, had more than 3 bits bad (in a sector) from an entire drive. If I set the format utility to mark a sector as bad if it had more than 3 bits in error, over 99% of our drives were error free. IBM drives were less than 75%. But if I set the threshold at 11 bits, those IBM drives suddenly went over 99% as well.
  • If everyone moved off of Windows, all storage failures would cease

    "How Microsoft puts your data at risk"

    Sorry Robin but you permanently tainted your reputation when you wrote that piece. To your credit, you attempted as best you could to be more factual in the body of your blog but even there you messed it up:

    "Microsoft's NTFS (used in XP & Vista) with its de facto monopoly is the worst offender."

    Got it, so NTFS is the worst.

    "But Apple and Linux aren't any better."

    Then why didn't you say that HFS and ext3 / ext4 are the worst? After all, if they aren't any better than the worst, they are also the worst.

    Oh right, because you show your extreme bias.