What is a 'usable gigabyte'?

What is a 'usable gigabyte'?

Summary: Flash vendors make a big deal about 'usable gigabytes' as they struggle to show they are more cost-effective than disks. But is it a realistic metric?

SHARE:
TOPICS: Storage, Software
6

Flash storage, with its high performance reads and low power consumption, is remaking the storage industry. But raw flash is expensive, roughly 10-20 times the cost of raw disk, making the economic case more difficult.

A slew of flash vendors, including Pure Storage and, lately, HP, have invoked the idea that, with proper techniques, useable flash capacity can be competitive with the per-gigabyte cost of disk arrays. But are these comparisons fair and realistic?

Flash vendors have a point: Traditional enterprise disk arrays are notoriously under-utilized, with only 30-40 percent of their expensive capacity storing user data. That's because they are over-provisioned and conservatively managed to allow for end-of-quarter spikes and years of application growth.

But there's a technical issue too: Since flash can handle tens of thousands of I/Os per second, data structures — especially metadata — and compression techniques can be optimized in ways that aren't feasible for disk-based storage. While disk systems can use some of these techniques — and one has to wonder if the goal of selling more gigs caused them not to be employed — the fact is they haven't been, leaving a clear field for flash vendors.

The techniques flash vendors employ include some old standbys as well as more modern technologies. They all are a form of compression, although what gets compressed and how it is compressed vary widely.

Compression techniques

  • Data compression. Widely used in tape drives for decades — Lempel Ziv Welch (LZW) — this reduces character data by approximately 50 percent. Hardware LZW is very fast and easily handled inline.

  • De-duplication. Enterprises typically store many copies of very similar documents — think a presentation where a client name is changed — that can be stored with only the differences noted and restored when the document is read. 

  • Advanced erasure codes. RAID5 is an erasure code, but in the last 25 years rateless erasure codes have enabled much higher levels of protection with much lower overhead.

  • Thin provisioning. Traditional provisioning dedicates capacity to an app whether used or not. Thin provisioning "tells" the file system that the capacity is dedicated, but only allocates it when data is written.

  • Snapshots. Like a backup, but old data is only saved when updated (the copy-on-write algorithm) so the snapshots are typically very compressed.

HP also promotes something called Thin Clones. Haven't delved into it, but assume it's simply a file with differences only and pointers back to the original for unchanged blocks.

The Storage Bits take

So, are the "usable gigabyte" claims legitimate? Yes, if you understand some caveats.

All these techniques make assumptions about data and/or usage that may not always apply. For example, LZW assumes that data is compressible — i.e., approximately 50 percent entropy — but if you give it already compressed data, it's stuck and your "available" capacity suddenly drops.

De-duplication just keeps one copy of your data, plus a list of pointers and changes. If that list is corrupted, so is your data, maybe lots of data. So those data structures need to be bulletproof. I wouldn't rely on RAID5 to protect them.

Thin provisioning assumes that all apps aren't going to want all their provisioned capacity all at once. A pretty safe bet, but a bet nonetheless.

In the main then, the flash vendors are correct. For the most part, they've built in these features and others to work inline at wire speed so they don't impact performance. The array vendors could have done something similar, but chose not to.

As the declining sales of traditional enterprise RAID attests, they are now paying the price.

Courteous comments welcome, of course. I'm currently a guest at HP's Discover conference in Las Vegas. Question: Do you have any horror stories due to these "usable gigabyte" technologies? Be as specific as possible.

Topics: Storage, Software

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • I'm not comfortable with those characterizations

    as a lot of us are already storing data in compression algorithms, like LZW which is in the internals of a lot of image and metaimage file formats.What your real storage turns out to be could be less than what the vendor is telling you, if your file mix is different than the scenario they are making their claims with.
    Mac_PC_FenceSitter
    • The LTO media sales I've read about always give multiple numbers...

      raw (uncompressed) storage (like 1.5TB)
      compressed storage (such as 2TB)

      If your data is already compressed/encrypted then you use the raw uncompressed value.

      If the flash vendors aren't doing that, then they are lying.
      jessepollard
  • Agree!

    1st commandment for storage admins: know theyself - and thy data!

    Robin
    R Harris
    • Darn auto-correct!

      1st commandment for storage admins: know thyself - and thy data!

      Robin
      R Harris
  • Thin provisioning - first time I heard it said that way . . .

    "Traditional provisioning dedicates capacity to an app whether used or not. Thin provisioning 'tells' the file system that the capacity is dedicated, but only allocates it when data is written."

    First time I heard it said that way - usually I see it referred to as "copy on write."

    . . . although to be honest, none of those features are really unique to one format or another. You can implement them on a platter drive equally as well on solid state. In fact, I'm pretty sure many of those features (save for RAID 5) are already implemented on consumer PCs (NTFS and many Linux file systems support them), so I couldn't imagine something enterprise level not doing it.
    CobraA1
    • It is also called "oversubscription"...

      And is a very old process.

      It is also a guaranteed way to deadlock a system.
      jessepollard