Could drive vendors end transient drive failures?
Summary: Drive vendors have complained for years that half of all "failed" drives have no problems when tested. Maybe they need to build better drives. Here's a suggestion.
For several decades, drive vendors have defended the reliability of disks by noting that half of all returned drives worked just fine when tested. But why would customers go to the trouble of returning a perfectly good drive?
The problem is so widespread that there is a storage company mdash; X-IO — whose products can do a factory-level drive format on in-place drives to recover from transient drive failures. As a result, X-IO guarantees their storage performance, capacity, and availability for 5 years at no extra charge.
Obviously there was, and is, a disconnect somewhere. But where?
It's in the stack
Most industry people assume that the transient failure problem lies in the hardware and software stack. Disk drives have powerful microcontrollers running hundreds of thousands of lines of code and are expected to connect to thousands of versions of operating systems and file systems.
Every system vendor maintains drive qualification groups whose sole job is to ensure that a particular version of drive firmware works reliably with that vendor's OS and I/O stack. Once qualified, the vendor will insist that all drives continue to use that particular version of the drive firmware.
Is it any wonder, the thinking goes, that this causes problems?
A different view
Yesterday, I came across an excellent blog post about false disk failures from a senior LSI technologist, Robert Ober, that had a different view. Ober is a processor and system architect at LSI and holds dozens of patents.
In relating the experience of a large internet datacenter, he wrote:
... about 40 percent of the time with SAS and about 50 percent of the time with SATA, the drive didn't actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why?
He went on to relate his experience working on engine controllers, which is, he said:
... a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that's millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced...
So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get [the controller] operational again in less than a full revolution of the engine.
Disk drives don't include such protections. But maybe they should.
The Storage Bits take
The cost of transient drive failures is huge for both drive vendors and customers. Pulling out and returning a "failed" drive is expensive in time, money, and lost compute time. I'm sure vendors don't like the hassle either.
If a vendor could reduce their transient failure rate from 50 percent to 5 percent — still way worse than engine controllers — they could save themselves and their customers millions of dollars every year. The competitive advantage would be huge.
So why don't they? Do they not know any better? Or is the task of rewriting all that code to incorporate failsafe technologies just too daunting?
If I were running Seagate or WD, I'd want to know what the engineers could do to make transient failures much more rare. And while it would be a 3-5 year slog to make it happen, I'd do it.
Storage users deserve the best vendors can do. And right now, it doesn't look like we're getting it.
Comments are welcome, as always.
Robin Harris is currently doing some work with LSI and Robert Ober, which led to Harris reading Ober's post. Harris also previously worked with the CTO of X-IO 20 years ago.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Good article
Probably a financial balancing act
And which is more likely to impress customers?--"This drive holds double what a drive 3 years old could hold" or "There's literally a one-in-a-trillion or lower chance there could be a non-repeatable momentary failure on our drives and we improved that to one-in-a-quadrillion. Of course, to store the amount of data you want to store you'll have to buy TWO of our drives rather than one of our competitors, since we were concerned about improving reliability rather than storage density."
I think you nailed it.
Great write up.
Definitely financial, but they don't want to pay for their mistakes
Most of the big drive manufacturers do not care until their reputation is really hurt.
In the early years...
I wrote format utilities in the early PC years. Back then the drives from IBM had lots of errors, partially from floating electrical grounds and other noise sources. The IBM format program would not count a formatted sector in error until more than 11 bits were bad when reading/verifying a sector. Mind you, 11 bits sounds like a lot, and it is, but there are built-in error correcting algorithms to fix them. The drives my company were using did not have the noise errors and rarely, if ever, had more than 3 bits bad (in a sector) from an entire drive. If I set the format utility to mark a sector as bad if it had more than 3 bits in error, over 99% of our drives were error free. IBM drives were less than 75%. But if I set the threshold at 11 bits, those IBM drives suddenly went over 99% as well.
If everyone moved off of Windows, all storage failures would cease
"How Microsoft puts your data at risk"
Sorry Robin but you permanently tainted your reputation when you wrote that piece. To your credit, you attempted as best you could to be more factual in the body of your blog but even there you messed it up:
"Microsoft's NTFS (used in XP & Vista) with its de facto monopoly is the worst offender."
Got it, so NTFS is the worst.
"But Apple and Linux aren't any better."
Then why didn't you say that HFS and ext3 / ext4 are the worst? After all, if they aren't any better than the worst, they are also the worst.
Oh right, because you show your extreme bias.