'

Baidu chooses dumb SSDs. Should you?

SSDs have powerful processors and hundreds of thousands of lines of code to make them smart. But after installing over 300,000 smart SSDs, Chinese giant Baidu has opted for a different model: software defined flash. Here's how it works.

SSDs are fast, but their capacity is expensive. Due to over provisioning, parity ECC and other overhead, typically only 50 to 70 percent of that costly raw capacity is available for user data.

Not a big deal if you only have a few hundred. But when you have over 300,000 that overhead adds up to a significant cost.

special feature

The Evolution of Enterprise Storage

How to plan, manage, and optimize enterprise storage to keep up with the data deluge.

Read More

In an ASPLOS '14 paper, SDF: Software-Defined Flash for Web-Scale Internet Storage Systems, researchers Jian Ouyang, Shiding Lin, Zhenyu Hou, Yong Wang, and Yuanzheng Wang of Baidu, and Song Jiang of Wayne State University, discuss the problem and their solution.

Why over provision?

Unlike hard drives, whose spare blocks replace bad blocks, SSDs over provision for performance. A busy SSD has to maintain a pool of fresh blocks to handle random writes and the process of garbage collection.

In an enterprise SSD with a mixed read/write workload, and extra 8 percent of free blocks can increase throughput more than 400 percent. Over provisioning is key to high performance SSDs, because the SSD has no way of knowing what the application and OS are going to require.

Software defined flash

The researchers key insight for software defined flash:

By exposing the channels in commodity SSD hardware to software and requiring the write unit to be of the flash erase-block size, we can match the workload concurrency with the hardware parallelism provided by the multiple channels and minimize the interference in a flash channel . . . thereby exploiting individual flash's raw bandwidth.

Instead of relying on the SSD's processor to handle garbage collection and maintain a pool of empty blocks, with SDF only the host initiates block erasures. The host also schedules those erasures, which can take dozens of milliseconds, during the flash's idle times to minimize performance impact.

They also save flash capacity by eliminating parity coding across flash channels, relying on system level replication for data persistence and integrity. I presume that their object store implements other features as well to ensure data integrity.

Workloads

Web-scale data centers have some special workloads. As their spiders walk the web, they have thousands of servers generating index data in large sequential writes. Small writes are commonly merged into multi-megabyte sequential writes as well in a log structured object store.

Baidu's system uses an 8MB write block and an 8KB read block to optimize both read and write performance for flash.

Simplification

Due to Baidu's control of the entire hardware and software stack, the researchers could do more to simplify their custom SSD. They removed the DRAM cache common in SSDs, since the data could be kept in the server's main memory.

They also found it helpful to bypass the Linux I/O stack, replacing it with an ultra-light-weight user-space interface and thin driver. This cut total service request time by another 15 percent.

Combined with the reduction in over provisioning, flash translation layer and parity elimination, and simplified controller circuitry, Baidu cut the cost of an SSD by 50 percent. In addition, they've found that the SDF deilivers about 95 percent of the raw flash bandwidth and 99 percent of the flash capacity for user data.

They've deployed several thousand of the new SSDs, with more planned.

The Storage Bits take

Baidu's SDF will not become a commercial product, but the principles embodied in this work could be adapted to more general purpose systems. Effectively what they've done is adapted the workload and I/O stack to flash, much as the current I/O stack is adapted to disks.

That is a strategy that could be much more widely adopted. As we've seen with the growth in object storage and commodity-based storage - both pioneered in web-scale systems - these ideas will trickle down to the enterprise and perhaps even PC users.

The larger picture is that this is another skirmish in the continuing war between "smart" and "dumb" architectures. We also see this in Shingled Magnetic Recording, with HGST dumb storage with host management and Seagate going with intelligent devices that manage SMR issues without host involvement.

Who wins? We do, thanks to more flexible architectures and more cost-effective products. If you think you've seen everything in IT, wait a minute.

Comments welcome, as always. Question: if SSDs cost half what they do now, how many more applications for them would you find? A hat tip to Jim Handy, the SSD guy, and alert reader Eric for pointing this paper out.