SATA disk drives are normally in a one drive/one SATA controller port configuration. But in recent years, a new approach, known as the port multiplier, has extended this connectivity to multiple drives.
Researchers Peng Li and David J Lilja of the University of Minnesota, and James Hughes and John Plocher of FutureWei Technologies reported on SATA port multiplier behavior in a poster (PDF) presented at FAST '13. They conclude that port multipliers work well when the disks are working well — but not so well when a drive fails.
Inducing disk drive failure
Their first problem was figuring out how to induce a disk drive failure that would look like a normal disk drive failure. Simply disconnecting or powering down a disk drive happens too quickly.
Their solution was to remove the cover from the disk drive while it was under load. This typically resulted in the drive's failure within 3 to 4 minutes. They tested both Seagate and Western Digital hard drives, in both enterprise and consumer versions.
The researchers tested drive failures on a system running Linux with two SATA controllers. In the first testbed, there was one drive connected to each SATA controller. In the second testbed, there was one drive connected to one SATA controller and a port multiplier with two drives on the other SATA controller.
In the first setup, with no port multiplier, the failure of one drive had no impact on the other drive on the system. The test workload, the fio program, always completed.
In stark contrast, when a drive was failed on the port multiplier, the second drive on the port multiplier would also fail without completing the fio workload. This was true on both Seagate enterprise and Western Digital consumer drives.
The Storage Bits take
This research is not conclusive, and the authors hope to do more. Only a small number of drives on a single Linux platform were tested.
But it suggests that caution is in order. Using RAID software across a port multiplier array may result in an unrecoverable failure when a single drive fails.
It is possible, using advanced erasure coding or a high-end file system like Gluster, to use a large number of disk drives on port multipliers in such a way that even several failures will not compromise data integrity or availability. But this is not something the average SOHO user could implement.
Because disks are marvels of engineering and precision manufacturing, many people will have a port multiplier where no drive fails for years. But when one does, it could be brutal.
This points to a larger issue in IT: We have few independent sources of underlying technology evaluation. We are all guinea pigs.
Comments welcome, as always. Have you experienced a disk failure on a port multiplier? Please share what you learned.
Update: Below is a video of a running drive being taken apart. A rough process, but how else can you create a head crash on demand?