Flash vs Cash

Flash vs Cash

Summary: What really distinguishes the use of flash in Sun's second generation thumbers from its use by competitors like EMC? it's a cache versus cash thing.


There are two enormous differences between Sun's latest storage offerings and those from traditional RAID vendors like EMC:

  1. the 74XX series makes use of Sun's Dtrace technology to offer storage analytics software no one else can match.

    As I've said elsewhere the storage product is faster, cheaper and more reliable - but the star component in this product series is the analytics package because the radically clear thinking on information presentation it implements could reasonably form the basis for a future data center management package.

  2. the 74XX series is very much a second generation "thumper" design exploiting the use of Sun's ZFS file system to provide a balance between cost and performance that, again, no one else can match.

One of the ways in which Sun exploits ZFS to provide high performance at relatively low cost involves the use of flash devices to extend and support ZFS caching - an approach that differs radically from that taken by traditional vendors who, lacking both ZFS and Solaris, are mostly constrained to selling flash in a disk replacement role.

Here's how Adam Leventhal, one of "the Dtrace three", described the differences in a Dec 01/08 blog entry:

Casting the shadow of the Hybrid Storage Pool

The debate, calmly waged, on the best use of flash in the enterprise can be summarized as whether flash should be a replacement for disk, acting as primary storage, or it should be regarded as a new, and complementary tier in the storage hierarchy, acting as a massive read cache. The market leaders in storage have weighed in the issue, and have declared incontrovertibly that, yes, both are the right answer, but there's some bias underlying that equanimity.

Chuck Hollis, EMC's Global Marketing CTO, writes, that "flash as cache will eventually become less interesting as part of the overall discussion... Flash as storage? Well, that's going to be really interesting."

Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that "Flash is too expensive to replace disk right away, so first we'll see a new generation of storage systems that combine the two: flash for performance and disk for capacity." So what are these guys really talking about, what does the landscape look like, and where does Sun fit in all this?

Flash as primary storage (a.k.a. tier 0)

Integrating flash efficiently into a storage system isn't obvious; the simplest way is as a direct replacement for disks. This is why most of the flash we use today in enterprise systems comes in units that look and act just like hard drives: SSDs are designed to be drop in replacements. Now, a flash SSD is quite different than a hard drive ? rather than a servo spinning platters while a head chatters back and forth, an SSD has floating gates arranged in blocks... actually it's probably simpler to list what they have in common, and that's just the form factor and interface (SATA, SAS, FC). Hard drives have all kind of properties that don't make sense in the world of SSDs (e.g. I've seen an SSD that reports it's RPM telemetry as 1), and SSDs have their own quirks with no direct analog (read/write asymmetry, limited write cycles, etc). SSD venders, however, manage to pound these round pegs into their square holes, and produce something that can stand in for an existing hard drive. Array vendors are all too happy to attain buzzword compliance by stuffing these SSDs into their products.

The trouble with HSM is the burden of the M.

Storage vendors already know how to deal with a caste system for disks: they striate them in layers with fast, expensive 15K RPM disks as tier 1, and slower, cheaper disks filling out the chain down to tape. What to do with these faster, more expensive disks? Tier-0 of course! An astute Netapp blogger asks, "when the industry comes up with something even faster... are we going to have tier -1" ? great question. What's wrong with that approach? Nothing. It works; it's simple; and we (the computing industry) basically know how to manage a bunch of tiers of storage with something called hierarchical storage management. The trouble with HSM is the burden of the M. This solution kicks the problem down the road, leaving administrators to figure out where to put data, what applications should have priority, and when to migrate data.

Flash as a cache

The other school of thought around flash is to use it not as a replacement for hard drives, but rather as a massive cache for reading frequently accessed data. As I wrote back in June for CACM, "this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management. Tersely, HSM without the M. This idea forms a major component of what we at Sun are calling the Hybrid Storage Pool (HSP), a mechanism for integrating flash with disk and DRAM to form a new, and ? I argue ? superior storage solution.

Let's set aside the specifics of how we implement the HSP in ZFS ? you can read about that elsewhere. Rather, I'll compare the use of flash as a cache to flash as a replacement for disk independent of any specific solution.

The case for cache

It's easy to see why using flash as primary storage is attractive. Flash is faster than the fastest disks by at least a factor of 10 for writes and a factor of 100 for reads measured in IOPS.

Replacing disks with flash though isn't without nuance;

there are several inhibitors, primary among them is cost. The cost of flash continues to drop, but it's still much more expensive than cheap disks, and will continue to be for quite awhile. With flash as primary storage, you still need data redundancy ? SSDs can and do fail ? and while we could use RAID with single- or double-device redundancy, that would cleave the available IOPS by a factor of the stripe width. The reason to migrate to flash is for performance so it wouldn't make much sense to hang a the majority of that performance back with RAID. The remaining option, therefore, is to mirror SSDs whereby the already high cost is doubled.

It's hard to argue with results, all-flash solutions do rip. If money were no object that may well be the best solution (but if cost truly wasn't a factor, everyone would strap batteries to DRAM and call it a day).

Can flash as a cache do better? Say we need to store a 50TB of data. With an all-flash pool, we'll need to buy SSDs that can hold roughly 100TB of data if we want to mirror for optimal performance, and maybe 60TB if we're willing to accept a far more modest performance improvement over conventional hard drives. Since we're already resigned to cutting a pretty hefty check, we have quite a bit of money to play with to design a hybrid solution.

If we were to provision our system with 50TB of flash and 60TB of hard drives we'd have enough cache to retain every byte of active data in flash while the disks provide the necessary redundancy. As writes come in the filesystem would populate the flash while it writes data persistently to disk. The performance of this system would be epsilon away from the mirrored flash solution as read requests would only go to disk in the case of faults from the flash devices. Note that we never rely on correctness from the flash; it's the hard drives that provide reliability.

The performance of this system would be epsilon away from the mirrored flash solution...

The hybrid solution is cheaper, and it's also far more flexible. If a smaller working set accounted for a disproportionally large number of reads, the total IOPS capacity of the all-flash solution could be underused. With flash as a cache, data could be migrated to dynamically distribute load, and additional cache could be used to enhance the performance of the working set. It would be possible to use some of the same techniques with an all-flash storage pool, but it could be tricky. The luxury of a cache is that the looser contraints allow for more aggressive data manipulation.

Building on the idea of concentrating the use of flash for hot data, it's easy to see how flash as a cache can improve performance even without every byte present in the cache. Most data doesn't require 50?s random access latency over the entire dataset, users would see a significant performance improvement with just the active subset in a flash cache.

Of course, this means that software needs to be able to anticipate what data is in use which probably inspired this comment from Chuck Hollis: "cache is cache ? we all know what it can and can't do." That may be so, but comparing an ocean of flash for primary storage to a thimbleful of cache reflects fairly obtuse thinking. Caching algorithms will always be imperfect, but the massive scale to which we can grow a flash cache radically alters the landscape.

Even when a working set is too large to be cached, it's possible for a hybrid solution to pay huge dividends. Over at Facebook, Jason Sobel (a colleague of mine in college) produced an interesting presentation on their use of storage (take a look at Jason's penultimate slide for his take on SSDs). Their datasets are so vast and sporadically accessed that the latency of actually loading a picture, say, off of hard drives isn't actually the biggest concern, rather it's the time it takes to read the indirect blocks, the metadata. At facebook, they've taken great pains to reduce the number of dependent disk accesses from fifteen down to about three. In a case such as theirs, it would never be economical store or cache the full dataset on flash and the working set is similarly too large as data access can be quite unpredictable. It could, however, be possible to cache all of their metadata in flash. This would reduce the latency to an infrequently accessed image by nearly a factor of three. Today in ZFS this is a manual setting per-filesystem, but it would be possible to evolve a caching algorithm to detect a condition where this was the right policy and make the adjustment dynamically.

Using flash as a cache offers the potential to do better, and to make more efficient and more economical use of flash. Sun, and the industry as a whole have only just started to build the software designed to realize that potential.

The bottom line, in other words, is that both Sun and its competitors are selling flash - but with Sun your cash gets you cache, and with the others it just gets you flash.

Topics: Data Centers, Hardware, Oracle, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Global Marketing CTO = oxymoron?

    Technical marketing? Is that taught in schools?

    Having Dtrace being able to scan through the entire stack from OS to storage is the game changer. EVERYONE is interested in system performance, and to do an accurate assessment you need good tools. These do NOT exist today in the UNIX world. My teams have struggled to do this with AIX - because there are so many virtualized layers. When you virtualize your system, your problems also become virtual - and are MUCH harder to track down.

    We had a giant financial package running at Ford on a large P690. It was connected to a SAN and EMC storage. The filesystems were VERITAS. We had a problem with I/O slowness, so we tried to figure out what was wrong. The AIX tools barely showed any problem. VERITAS reported everything was A-OK. EMC showed all disks were functioning within spec and the cache-hit ratios were good. Where's the problem? At least Oracle reported that the filesystems had long wait times - so we had SOME evidence of what we all could "feel". It's very hard to track down I/O problems through the layers - and if Dtrace + ThumperII can solve that, I see a mass migration to Sun ;)
    Roger Ramjet
    • Been there, suffered that

      AIX is notorious for this kind of thing. What I found as "the right answer" was to internalize all primary disk storage - i.e. manage all the disks it used directly from AIX with no EMC or IBM external RAID. That got about a 5x speed up -on a P680.
    • Been there, suffered that

      An operations dept. claimed for years that the
      very evident "stalls" were not there, that
      performance was stellar and that the hw was A-

      In this case it was Solaris on Fujitsu. To my
      knowledge they still havent't solved it,
      although they have recognized that there may be
      a problem (There bloody well is when even the
      simplest queries through Oracle can freeze for
      10+ seconds).
      • Long Oracle stalls on Solaris?

        Did you read the wintel case study a few weeks back?
      • Oracle has a nice stats package

        It will show you the I/O performance for each filesystem - and it will show slowdowns or stalls. But that just confirms what you already know by seat-of-the-pants reasoning. FIXING the problem is another story . . .
        Roger Ramjet
  • Format errors

    Sorry about the format errors - I think I've caught most of them now.

    (The excuse - which is what it is - is that there's no preview on posting and wordpress HTML isn't terribly reliable.)
    • Re: Format errors

      Curious to know what wintel line of thinking is causing the lack of preview. I am pretty sure UNIX had preview since the 70s.
      tick tock
      • wysiwyg

        They're moving from an html editor for which preview makes sense to a wintel style wysiwyg editor for which (they think) it doesn't.
  • Good piece

    Nice to see that you really can bring some interesting
    information forward.

    ZFS does look nice.
  • So it's just Sun marketing blogs now eh Murph?

    Things must be going really badly for Sun...
  • Boring.............Where's the Microsoft hate?

    You'll get more talkbacks if you just do the usual. How about flash memory doesn't work in Windows or how about TCO increases if you try using flash memeory in Windows?
  • RE: Flash vs Cash

    I attended a Sun product seminar for the 74xx and it looks pretty nice.

    Couple of comments:
    Sun is taking a long time to make Fibre Channel; I was told the reason is the large number of FC HBAs, switches, etc. that need to be tested.

    Also, I think the "just one RAID configuration for the whole array" will be a sales turnoff -- I want to be able to have more granular control, especially on the larger arrays -- i.e., one set of drives in a high-performance (mirroring) say for Oracle, and another set of drives in RAID 5 or 6 to use for file sharing (like a NAS). I suspect that a good number of potential customers may think likewise. I might not NEED the control, but my gut reaction is that I want it.

    • Well, you actually have that fine grained control

      Just make a different file system on different drives for each one and away you go.
      • Not on the 7xxx appliances

        Note that my comments were about the 7xxx Amber Road storage appliances.

        During the Sun "Amber Road" seminar that I attended, the Sun rep demonstrated the configuration process for the 74xx storage appliance, and one of the questions is how to configure all the storage in that appliance. There is no option to configure part of the storage one way (say Raid 5 or 6) and part another way (like mirrored disks).

        Of course, if you build your own storage server with Solaris/Open Solaris and ZFS, you will have more control.

    • RE: RAID Control

      I agree. There are benefits to multiple RAID types as each has it's pros/cons for certain workloads. Who knows what's in the future for this line but that would be nice to see. Right now though this can be achieved by having a different RAID Type on the storage pool for each node in a cluster. Here's two interesting notes though: 1) NetApp *really* only lets you choose RAID-DP and they sell like hotcakes 2) SSD as cache will typically more than compensate for any performance concerns that a RAID type can introduce.
  • Probably your most interesting topic for quite some time.

    I've only just got round to considering Solaris (or OpenSolaris). Mainly for ZFS.

    Just one question:
    When using ZFS, is hardware RAID (Level 5/50) still relevant?
    • Usually not

      ZFS does raid in software -typical hw raid controllers just slow things down and add cost. i.e. JBODS are usually the way to go.

      However.. some multiport raid controllers that do mirroring extremely efficiently can be useful. Use ZFS on half the disks and (usually) two controllers. Have HW mirroring on the array, use two controllers to connect the mirrored jbod image to the backup computer.

      You can generally get the same effect more efficiently just using Soilaris/ZFS, but doing it this way has the effect of letting you make the backup disks and computer completely invisibe - and inaccessible to the root weilders on the production side. i.e. This sucks as a backup or cluster "solution" but is great for protecting against some classes of internal risk.
      • Thanks Paul.


  • dtrace

    You claim that dtrace is the number one argument for buying Sun storage. Dtrace is for troubleshooting application and kernel problems.

    Sounds like lot of fun debugging those problems, but I would rather put my money in a storage system where dtrace is not needed.
    • RE: dtrace

      I suggest you watch Sun's demos of how dtrace is used in their storage - it's unbelievable. Dtrace can be used as you mentioned OR it can be used to give a whole new level of insight into performance. Most storage vendors provide some sort of indicators to how performance is going. Sun with Dtrace takes it to the Nth degree. It has nothing to do with troubleshooting kernel problems but it does work extremely well to determine if the storage is a bottleneck, is bandwidth limited, what's the latency on a LUN, how is a particular VM performning, etc. etc. etc. Imagine unprecedented, real-time insight into the complete health and performance of your storage - far beyond what anyone else can do today. To say dtrace is only for troubleshooting problems as you say is a complete misunderstanding and lack of even looking at what it does. Visit Sun's website and see the online demo - or download the simulator and run your own workload against it. It's lots of fun and pretty darn amazing. Take 5-10 minutes and visit their website and watch the demos - well worth the time.