X
Tech

What is a 'usable gigabyte'?

Flash vendors make a big deal about 'usable gigabytes' as they struggle to show they are more cost-effective than disks. But is it a realistic metric?
Written by Robin Harris, Contributor

Flash storage, with its high performance reads and low power consumption, is remaking the storage industry. But raw flash is expensive, roughly 10-20 times the cost of raw disk, making the economic case more difficult.

A slew of flash vendors, including Pure Storage and, lately, HP, have invoked the idea that, with proper techniques, useable flash capacity can be competitive with the per-gigabyte cost of disk arrays. But are these comparisons fair and realistic?

Flash vendors have a point: Traditional enterprise disk arrays are notoriously under-utilized, with only 30-40 percent of their expensive capacity storing user data. That's because they are over-provisioned and conservatively managed to allow for end-of-quarter spikes and years of application growth.

But there's a technical issue too: Since flash can handle tens of thousands of I/Os per second, data structures — especially metadata — and compression techniques can be optimized in ways that aren't feasible for disk-based storage. While disk systems can use some of these techniques — and one has to wonder if the goal of selling more gigs caused them not to be employed — the fact is they haven't been, leaving a clear field for flash vendors.

The techniques flash vendors employ include some old standbys as well as more modern technologies. They all are a form of compression, although what gets compressed and how it is compressed vary widely.

Compression techniques

  • Data compression. Widely used in tape drives for decades — Lempel Ziv Welch (LZW) — this reduces character data by approximately 50 percent. Hardware LZW is very fast and easily handled inline.
  • De-duplication. Enterprises typically store many copies of very similar documents — think a presentation where a client name is changed — that can be stored with only the differences noted and restored when the document is read. 
  • Advanced erasure codes. RAID5 is an erasure code, but in the last 25 years rateless erasure codes have enabled much higher levels of protection with much lower overhead.
  • Thin provisioning. Traditional provisioning dedicates capacity to an app whether used or not. Thin provisioning "tells" the file system that the capacity is dedicated, but only allocates it when data is written.
  • Snapshots. Like a backup, but old data is only saved when updated (the copy-on-write algorithm) so the snapshots are typically very compressed.

HP also promotes something called Thin Clones. Haven't delved into it, but assume it's simply a file with differences only and pointers back to the original for unchanged blocks.

The Storage Bits take

So, are the "usable gigabyte" claims legitimate? Yes, if you understand some caveats.

All these techniques make assumptions about data and/or usage that may not always apply. For example, LZW assumes that data is compressible — i.e., approximately 50 percent entropy — but if you give it already compressed data, it's stuck and your "available" capacity suddenly drops.

De-duplication just keeps one copy of your data, plus a list of pointers and changes. If that list is corrupted, so is your data, maybe lots of data. So those data structures need to be bulletproof. I wouldn't rely on RAID5 to protect them.

Thin provisioning assumes that all apps aren't going to want all their provisioned capacity all at once. A pretty safe bet, but a bet nonetheless.

In the main then, the flash vendors are correct. For the most part, they've built in these features and others to work inline at wire speed so they don't impact performance. The array vendors could have done something similar, but chose not to.

As the declining sales of traditional enterprise RAID attests, they are now paying the price.

Courteous comments welcome, of course. I'm currently a guest at HP's Discover conference in Las Vegas. Question: Do you have any horror stories due to these "usable gigabyte" technologies? Be as specific as possible.

Editorial standards