Apologies to Paul Simon
Disk drives are marvelous devices. Especially when they go "clunk" and stop working. I'm not kidding: at least you know your data is hosed. I prefer that to the silent data corruption you don't find out about until you can't access a file or your OS starts freezing. Or a RAID rebuild fails.
Silent data corruption is common
You just don't know it. Many low-end RAID controllers don't report problems, figuring you'll never notice. If you do notice, months later, what is the chance that you'll know it was the controller's fault?
Back up is better than insurance
Insurance is designed to protect you against damaging but uncommon events. But data loss is very common. Backup isn't insurance. It is simple digital hygiene. You'll use it again and again.
What are disks made of?
Hard drives sit at the bottom of a stack of hardware and software that usually gets your data from your CPU to the disk and back. But there are a lot of places where things can go wrong.
Here's a partial list:
Media: those beautifully plated silver disks are subject to a couple of major problems:
Wear out: disks have a lot of moving parts. In a 7200 RPM drive the disks are spinning 120 times per second compared to the 500 RPM of a CD drive. After a few years the motor can start to go. It may become slightly erratic, so some bits get squeezed and others get smeared.
The arm that moves the heads may can move dozens of times per second. When the bearings get loose it can go off track and corrupt data on adjacent tracks.
Electrical: if the drive power supply fails your drive will shut down. But if it is slowly degrading it can create extra heat or power surges that affect already marginal components. Component failures leading to sudden death are not seen by SMART reporting, which is one reason why SMART isn't much use.
Software: drives contain small computers that run on several hundred thousand lines of code. Is that code bug free? Need you ask? Among the more common bugs - and let's not get started on the less common ones - are:
- New code that fixes a problem and accidently breaks old code
- Putting the right data in the wrong place.
- Phantom writes that are reported as written but, oops!, aren't.
- Cache management bugs that munge data, or return correct data to the wrong place.
- OK, this is less common, but sometimes the on-disk ECC miscorrects the data. ECC is software, right? How do you know it always works correctly? You don't.
Bus controllers: whether managing IDE, ATAPI, SATA, SSA or FC, controllers are small computers running code. Bugs in controller code have corrupted data in the past and will no doubt do so again.
RAID controllers: again, small computers running code subject to bugs, as well as all manner of electrical, connector and cable problems. One insidious problem is corruption of RAID 5 parity data. It is pretty simple to check a file by reading it and matching the metadata. Checking parity data is much more difficult, so you typically won't see parity errors until a rebuild. Then, of course, it is too late.
The Storage Bits take
While this list is admittedly incomplete - and less than 50 if you're counting- I'm hoping it will help readers understand why backing up your data is worth the time and money. Modern data storage is a miracle of mass-produced high-technology, but it isn't perfect. Disks will fail. Power will surge. Bugs will surface. You can't avoid them.
What you can avoid is losing your data. If you don't already have a cheap external USB drive, go buy one and at least store your documents and email on it. You won't regret it.
Next: some more way our systems lose data and what vendors can do - and I know at least one of them is doing - to protect our data from silent data corruption.
Comments welcome, of course. As I was writing this a friend called me in a panic saying "I think my hard drive is going out!"
"Good thing you have it backed up" I said. Of course, he didn't. He's out buying a USB drive this very minute.