It's a truism in storage: consumer's average files are bigger, making bandwidth more important than IOPS. But new research shows that's not true - among other interesting results.
A recent paper, A Study of Practical Deduplication (pdf) by William Bolosky of Microsoft Research and Dutch Meyer of the University of British Columbia looked at how Windows file systems have evolved in the last decade. The paper was presented at the Usenix FAST '11 conference and won the Best Paper award.
- Median files sizes aren't changing. Yes, the largest files are larger - think audio and especially HD video - but small files continue to proliferate keeping the median file size unchanged for 30 years.
- Average file sizes are larger. While average sizes remain the same, the mean file size has tripled in 10 years to 318k.
- Average file system capacities have tripled. In 2000 few Windows machines had more than 50 GB in their file systems. The systems in the new study found an average of 194 GB of capacity.
- The variety of file types is increasing. The 10 most popular file extensions account for less than 45% of capacity vs over 50% in 2000. Files with no extension are now the most common.
- Defrag works. The background defrag built into Windows works. Researchers found fewer than 4% of all files were fragmented.
The Storage Bits take The popularity of SSDs isn't just because they're cool: the proliferation of small files - and the IOPS needed to access them - needs the fast random read performance of SSDs. Seagate is on the right track with their hybrid flash/disk drives.
While the amount of stored data isn't growing as fast as storage capacity, the tripling of file system capacity points up the need for higher data integrity. The more data you store the more likely our crummy file systems are to corrupt your data.
And finally, it's good to see that the background defrag built into Windows and Mac OS - though the latter wasn't included in the study for some reason - actually works. Sometimes problems do get solved.
Comments welcome, of course. BTW, it turns out that simple whole file deduplication combined with sparse file support is an effective - and much simpler - way to deduplicate data.