Backblaze, the cloud backup provider, pulled almost 5,000 3TB drives out of service after the annual failure rate reached a startling 47 percent. Which drive? Why? And what can we learn from this debacle?
The 2011 Thai floods devastated global hard drive production. An estimated 25-50 percent of global HDD production was lost while workers shoveled mud out of their clean rooms and factories.
Meanwhile storage demand continued unabated, especially at Backblaze, as they report in a blog post published today. They embarked on an epic quest to secure enough drives, eventually buying thousands of external and internal drives to meet their rapid growth.
A 3TB Seagate drive - with an identical model number (ST3000DM001) for internal and external versions - was available and economical. Beginning in November 2011 Backblaze bought almost 5,000 - about 15PB worth - roughly half internal and half external.
At first the drives worked well, but by mid-2013 the annual failure rate started rising. Backblaze replaced failed drives and started finding that additional drives in the same storage pod failed during the rebuild.
After three rebuild failures, Backblaze started removing and testing the remaining drives in the failed pods and found that roughly 75 percent failed the Backblaze tests. Those drives were not put back in service.
The Seagate drive purchases saved Backblaze over a million dollars in storage costs at a time when drive prices reached historic highs. But there was an unexpected price to pay: by the end of March, 2015, 32 percent of the drive deployed drives had failed and even more had been removed from service after failing external testing.
As good engineers do, Backblaze attempted a root cause analysis. Were the v2.0 storage pods the cause? The shucking of external drives? Or the drive itself?
But other drive models fared much better in the same v2.0 enclosures. Both internal and external drives failed at similar rates, while other vendor's external drives did not fail at a higher rate.
Which leaves the drive itself as the likely culprit.
The Storage Bits take
What went wrong? The 2011 floods stressed the entire drive supply chain, not just final assembly. It may be that in an effort to meet demand, Seagate cut corners on incoming component QA. Or perhaps the final assembly process was compromised.
Whatever the problem it seems to be solved. Backblaze found that Seagate 4TB drives are fine, with a 2.6 percent AFR as of the end of 2014, and over 17,000 4TB Seagate drives purchased.
As the owner of a Seagate 3TB external drive myself, am I concerned? No.
Why? Because after decades of experience in the storage industry I know that any single device can fail at any time and that all - excepting nearly immortal M-discs - will eventually fail.
All the data on my 3TB Seagate is backed up - just as all your data should be as well. While we can't eliminate all data loss - see The universe hates your data- regular backups, including cloud backup services such as Backblaze and Crashplan, can dramatically reduce your risk.
Hard disk drives are absolute marvels of engineering and mass production. While they aren't perfect - and neither are SSDs - they've made our always-on digital culture possible.
Despite the Seagate 3TB drive debacle, I, Backblaze, and you can confidently buy current drives. Just remember: all persistent data needs backup. If you do, you'll be fine.