X
Business

Facebook's SSD findings: Failure, fatigue and the data center

​SSDs revolutionized data storage, even though we know little about how well they work. Now researchers at Facebook and Carnegie-Mellon share millions of hours of SSD experience
Written by Robin Harris, Contributor

Millions of SSDs are bought every year. It's easy to be impressed by fast boots and app starts. But what about 24/7 data center operations? What are the common problems that admins should be concerned about?

That's what A Large-Scale Study of Flash Memory Failures in the Field by Justin Meza and Onur Mutlu of Carnegie Mellon University and Qiang Wu and Sanjeev Kumar of Facebook help answer about SSD behavior.

Basic methodology

Facebook was an early adopter of SSDs. They were Fusion-io's - the pioneer PCIe SSD developer - biggest customer for years, so their SSD experience is deeper than most: millions of device days are included in the study.

Unfortunately the study doesn't break results out by vendor. Instead, SSDs are classified by age, which means the oldest are roughly first gen devices, while newer devices are second gen.

More important is the team's definition of failure: an uncorrectable read error (URE) leading to data loss. That doesn't mean the SSD was dead, but they did find that SSDs that had one URE were much more likely to have another.

Unlike you, Facebook also favors maximum capacity enterprise SSDs: the most recent generation is 3.2TB. These aren't 35¢/GB SATA notebook drives, but over-provisioned PCIe SSDs, designed for high duty cycles.

Furthermore, since SSDs don't relay internal read errors that the controller can correct, the only read errors the study captured were those that got reported to the server. Servers can sometimes reconstruct data that SSD controllers can't, so this is device level reporting, not media level.

What they found

The good news: some issues that worry people, aren't issues. The bad news: there's other stuff to worry about.

Temperature

SSDs are sensitive to temperature - more so than hard drives. When they get hot, the SSD may throttle back performance. Unexplained slowdowns on some servers: check temperature.

The first gen SSDs failed more often as temp rose, possibly due to a lack of throttling. Some second gen SSDs throttle aggressively enough to reduce failure rates, while others kept the failure curve flat.

Bus power

SSDs are thirsty. PCIe v2 SSDs ran anywhere from 8 to 14.5 watts, a high and surprisingly wide range. The team found that as power consumption rose, so did failure rates.

Write fatigue

The team found that the level of system write activity correlated with SSD failure, probably because flash writes require a lot of power. Disks could be a better choice for heavy write applications such as logging.

SSD failures

SSD failures - i.e. UREs - are relatively common: 4.2 to 34.1 percent of the SSDs reported uncorrectable errors. In fact, 99.8 percent of the SSDs reporting an error in one week reported another error in the next week.

Life cycle and failures

The SSD failure profile differs from disks, where the latter exhibit infant mortality, then a few years of good reliability, before age catches up with them. SSDs have an early period of UREs as faulty cells are identified, increasing reliability, until cell wear-out leads to increasing read failures.

The data layout surprise

Disk drives aren't much affected by data layout - unless it involves lots of random seeks. But SSDs, very different.

Sparse logical data layouts - non-contiguous data - lead to higher SSD failure rates as do very dense data structures. My reading: problems in the logical-to-physical address logic in SSD controllers. Update: Alert reader Wilback noted that the paper theorized that "Such behavior is potentially due to the fact that sparse data allocation can correspond to access patterns that write small amounts of non-contiguous data, causing the SSD controller to more frequently erase and copy data compared to writing contiguous data."

The Storage Bits take

Props to the CMU/FB team for an important paper. We all knew that SSDs should be different than disks - solid state vs mechanical - but the how was not predictable.

PC SSDs probably see higher error rates, but users - like me - don't often notice. And if there is a data problem - like I had last week on a MacBook Air 500GB SSD - we have no idea where the problem originated. SSD? Pathetic HFS+ file system? Malware? Cosmic rays?

If you manage servers using SSDs, you should read this paper. It offers an evidence-based view of SSD behavior, and offers empirical specifics about SSDs available nowhere else.

Comments welcome, as always. What surprises you about these results?

Editorial standards