Before tackling my Thanksgiving Day's supper (which came at my nephew's apartment in the form of a steak this year instead of a turkey), I read through an article in the November ACM Queue about potential data reliability issues with high-capacity hard disks. The concerns over these new HDDs — certain to be a hit with storage-hungry, content-creating Mac users — made me rethink my backup and archive strategy as well as what kind of storage I will need for Leopard's Time Machine snapshots.
The article in question, Hard Disk Drives: The Good, The Bad and The Ugly, is a technical article by Jon Elerath, a manager of reliability engineering at enterprise storage company Network Appliance. It discusses a range of reliability issues inherent in drives using super-low-flying read/write heads and perpendicular magnetic recording technology to increase capacities. These technologies are now in use on desktop drives with a capacity of 1TB and in the highest-capacity notebook drive on the market.
The new technology required to achieve these capacities is not without concern. Are the failure mechanisms or the probability of failure any different from predecessors? Not only are there new issues to address stemming from the new technologies, but also failure mechanisms and modes vary by manufacturer, capacity, interface, and production lot.
Of course, this article is aimed at storage professionals and managers of RAID systems and servers. That market segment increasingly includes Mac customers, however, the lure of high-capacity storage is felt by a much larger group of Mac users, whose business and personal workflows handle large content files and valuable content.
The article starts out with an interesting decision tree on potential read problems with HDDs, some of these can be "operational," with an electrical cause or a misalignment of the head assembly and the data on the disk surface. The other branch of the tree concerns "latent" failures, which I found the most troubling.
Failures where the data is still good and uncorrupted such as happen with a problem with electrical, mechanical, or magnetic function impairment can be more-easily detected and accommodated. But the next-generation of high-capacity mechanisms could be more susceptible to data corruption. Worse, this is the leading problem with HDDs, according to research described in the report.
Elerath says: "Hard-disk drives don't just fail catastrophically. They may also silently corrupt data."
Part of the problem is one of scale, it appears by my reading. Hard disk technology is really a miracle of manufacturing and engineering. However, issues that weren't such a problem before in lower-capacity media and with previous read/write head technologies, may crop up when way more data is packed into the same small place. According to the paper, a complex mix factors can increase the chance for "latent" defects.
Latent defects are the most insidious kinds of errors. These data corruptions are present on the HDD but undiscovered until the data is read.
Of course, drives have algorithms in firmware to attempt to recover missing and corrupted data. The mechanism looks "off-track" or around the place it thinks the data should be. In a RAID system, the controller reconstructs the missing data using the parity information. If you don't have redundancy (like most of us), then you can only hope that the drive can recover the data.
Depending on the size of the media defect, this may be a few blocks or hundreds of blocks. As the areal density of HDDs increases, the same physical size of defect will affect more blocks or tracks and require more time for re-creation of data. One tradeoff is the amount of time spent recovering corrupted data. A desktop HDD (most ATA drives) is optimized to find the data no matter how long it takes. In a desktop there is no redundancy and it is (correctly) assumed that the user would rather wait 60 seconds and eventually retrieve the data than have the HDD give up and lose data.
But will the OS or the application or even the user wait 60 seconds? Most applications will time out in that span. And that the drive attempts to find the data doesn't mean that it will eventually find it — that's why servers use RAID.
So, what does all this mean in for Mac users? Here's my first take:
RAID Level 1. To protect against data rot, a single drive isn't good enough anymore. I suggest that users consider a mirrored array for a Time Machine drive. The hope would be that one of the drives will correctly read and store your data correctly and then retrieve it. This is the most expensive choice for data, however, it is a simple and effective solution.
Coddle your drives. With more data packed onto a platter, modern hard drives are more susceptible to physical problems. Make sure that you have a good, cushioned case for your notebook.
I'm always amazed that people will spend $1K or $2K on a notebook and then brag about the cheap bag they have for it. And I don't put my notebook in an overhead anymore after a guy yanked his bag out and pulled my briefcase along with it.
This goes for desktop drives as well. Please treat 3.5-inch drives with respect and handle them with care — they are more fragile than notebook drives.