Most disk drives include a feature named SMART - Self-Monitoring, Analysis and Reporting Technology - intended to tell you if your drive is dying. Can you rely on it? Sadly, no. Here's why.
What is SMART?
SMART is a protocol for passing information from a disk drive to the CPU. The protocol is part of the ATA and SCSI standards and is based on work by IBM, Seagate and others done in the '90s. The vendors generously placed their work in the public domain.
The protocol mandates a consistent structure for presenting drive data, but the data that gets measured is up to the drive vendor. Typically, SMART will present information on
- head flying height
- spin-up time
- bad block count
- seek time
- drive calibration retries
and more. SMART looks at the trends in these and other measures to determine if the drive is headed for failure.
Does SMART work?
According to Google's review of 100,000 drives, the answer is a qualified no. They found that enough drives failed without a SMART warning to make SMART useless for predicting drive life. But they also found that if SMART said there was a problem the drive was much more likely to fail.
So if SMART says you have a problem, you probably do. But if SMART says you don't have a problem, you can't trust it.
Why doesn't SMART work better?
Drives are complex pieces of equipment with many failure modes. The drive vendors decide what parameters are measured and what is the failure threshold. Since roughly 40% of the drives returned to vendors are NTF - No Trouble Found - vendors set the thresholds to ignore piddling errors. They might catch more failing drives, but only at the cost of even more NTF drives.
Another issue is that many drive failures can't be predicted. SMART mostly looks at mechanical trends, but disk drives are also electronic. A cracked capacitor, power surge or interface failure can kill a drive even if the data is still safely on the disk.
Finally, there are problems in storing and interpreting the SMART data. The data is stored in a small amount of RAM so the drive a) starts from scratch each time the drive is powered on and b) trends may be missed if the RAM fills up and is purged partway through an event.
The Storage Bits take
The drive vendors are doing the best they can creating larger and higher data rate drives. The intentions behind SMART are good, but its limitations mean that a "good" drive can go "bad" without warning.
The really smart answer: only regular backups can protect your data from sudden drive failure. Accept no substitutes.
Comments welcome, of course.