For several decades, drive vendors have defended the reliability of disks by noting that half of all returned drives worked just fine when tested. But why would customers go to the trouble of returning a perfectly good drive?
The problem is so widespread that there is a storage company mdash; X-IO — whose products can do a factory-level drive format on in-place drives to recover from transient drive failures. As a result, X-IO guarantees their storage performance, capacity, and availability for 5 years at no extra charge.
Obviously there was, and is, a disconnect somewhere. But where?
It's in the stack
Most industry people assume that the transient failure problem lies in the hardware and software stack. Disk drives have powerful microcontrollers running hundreds of thousands of lines of code and are expected to connect to thousands of versions of operating systems and file systems.
Every system vendor maintains drive qualification groups whose sole job is to ensure that a particular version of drive firmware works reliably with that vendor's OS and I/O stack. Once qualified, the vendor will insist that all drives continue to use that particular version of the drive firmware.
Is it any wonder, the thinking goes, that this causes problems?
A different view
Yesterday, I came across an excellent blog post about false disk failures from a senior LSI technologist, Robert Ober, that had a different view. Ober is a processor and system architect at LSI and holds dozens of patents.
In relating the experience of a large internet datacenter, he wrote:
... about 40 percent of the time with SAS and about 50 percent of the time with SATA, the drive didn't actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why?
He went on to relate his experience working on engine controllers, which is, he said:
... a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that's millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced...
So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get [the controller] operational again in less than a full revolution of the engine.
Disk drives don't include such protections. But maybe they should.
The Storage Bits take
The cost of transient drive failures is huge for both drive vendors and customers. Pulling out and returning a "failed" drive is expensive in time, money, and lost compute time. I'm sure vendors don't like the hassle either.
If a vendor could reduce their transient failure rate from 50 percent to 5 percent — still way worse than engine controllers — they could save themselves and their customers millions of dollars every year. The competitive advantage would be huge.
So why don't they? Do they not know any better? Or is the task of rewriting all that code to incorporate failsafe technologies just too daunting?
If I were running Seagate or WD, I'd want to know what the engineers could do to make transient failures much more rare. And while it would be a 3-5 year slog to make it happen, I'd do it.
Storage users deserve the best vendors can do. And right now, it doesn't look like we're getting it.
Comments are welcome, as always.
Robin Harris is currently doing some work with LSI and Robert Ober, which led to Harris reading Ober's post. Harris also previously worked with the CTO of X-IO 20 years ago.