Sun's ZFS/Flash initiative

Sun's ZFS/Flash initiative

Summary: Sun's forthcoming ZFS/Flash products will be interesting enough, and important enough, that most companies will be justified in exploring the technology a bit and perhaps experimenting with it ahead of time using a ramdisk to emulate the flash components.

Up to now, the coolest thing about ZFS, besides really making RAID cheap and easy to implement, has been its ability to silently correct the bit errors that creep in as data is stored, read, and written - a facility that's been particularly important to the raidz implementation.

In the near future, however, that's going to change and the coolest thing about ZFS is going to be its ability to make intelligent use of large amounts of flash memory to simultaneously speed disk I/O while letting you lower platter rotational speeds (and thus both wear and power use) to something on the order of 5400 RPM.

One of the core developers, Adam Leventhal, has an interesting article in the July ACM on how this going to work - the technology is non obvious, but the bottom line is simple: much faster, cheaper, and more reliable storage for big installations.

Properly configured systems using ZFS with flash in the storage hierarchy ahead of traditional disk should offer dramatic (order of magnitude) throughput gains on things like database transactions - and virtually eliminate some processing crisises I'd guess virtually all serious sysadmins have had to face.

Disk reconstruction delays and risks will, for example, essentially disappear - and if you mirror on two of the new JBOD arrays, layer in flash, and run something like Oracle or PostGresSQL, almost all of your backup and recover delays will disappear too.

More interestingly, there are oddball RDBMS admin problems that will get easier to resolve: a lot of production systems, for example, get constrained when databases grow beyond the point that backup and PC style table inversions (aka "cube" computation) can be done in the time available. In the past the right answer (switch to something more modern that queries the production system directly) has usually been administratively impossible, the fast answer (dump to text and use Perl) usually produces howls from outraged PC people, and the wrong answer (recreate the database schema you need on /tmp and run the inversion in memory mapped space) often becomes the only one that both works and doesn't create extensive conflicts with the MCSE crowd.

No more - with the ZFS/Flash layer in your storage hierarchy you can flash freeze the database without stopping production, and then do both your off-line backups and inversions at production I/O rates using whatever cycles are available -whether the users are on-line or not.

Most sysadmins will have run into this problem at some point - but there's a special variation too: one that's rare in general business but important to Sun's military and telco markets. What happens in these cases is that business needs make any delay on some transactions data unacceptable; so you set things up to cache the critical stuff in memory while keeping only a few indexes to the collateral data there - and then the database grows faster than you can get more memory. As a result you find yourself trying to continually re-optimize your index and caching to keep up - and that activity itself then causes more problems (especially when your boss brings in consultants "to help"). With ZFS managing the storage hierarchy and flash at the front, however, the distinctions disappear from view - and your system self-adapts as volumes change.

You can't buy ZFS/flash yet - and I'm guessing that when you do (early next year? - but note that it took Sun three years from the introduction of ZFS to its first new JBOD products) you'll pretty much need to run a workload measurement utility whose results determine the custom configuration you'll be ordering from Sun. Timing aside, however, I think that the bottom line on this for anyone now using Solaris for larger applications is that this stuff is going to be important - and that getting a running start by reading what you can and experimenting with the key ideas now (using ramdiskadm with ZFS) will pay off for you.

Topics: Enterprise Software, Data Centers, Data Management, Hardware, Oracle, Software, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Flash sizes?

    What size flash are we talking about here? There is no way we are flash freezing a 200G database in any flash implementation that I know of.
    • Please see the note from jamesd.wi, below

      As he says, it's the combination of flash with ZFS that works the miracle - we're not talking about a simple ssd implementation here.
  • Leading to what?

    Will this work - as soon and as well -for ZFS on Linux? (and Mac?)

    (If so cue Anton)

    How is it expected to affect the man in the street?

    Seems big organisations will be able to run their databases more efficiently / quickly / cheaply

    Will there be an element of to them that hath, shall be added to, etc? I.e. smaller firms and those who don???t take it up lose out?

    More outsourcing of data centre services?

    And Big Brother be he helpful or otherwise will be more capable?
  • RE: Sun's ZFS/Flash initiative

    you need to understand how ZFS works, it uses COW (copy on write) technology. So you can create a snapshot in a matter of seconds using normal drives, it litterally just stores a few k of data that is the current state of the filesystem, then the filesystem just stores changes to the data from that point on untill the snapshot is removed/destroyed.

    With Flash the few seconds is changed to nearly instantaneous snaphshottting because it can be written at the speed of flash that is not slowed by random writes. And once it is committed to flash its done. In a few seconds it will be committed to standard disk drives no need for a large flash drive, i haven't done much testing but a fast 8GB flash drive should be enough to store all but the largest work sets,
  • RE: Sun's ZFS/Flash initiative

    You can get flash SSDs in the range of 32GB to 128GB that's
    enterprise calibre.
  • RE: Correcting spelling

    1) I corrected the spelling on Alan Leventhal's name just now; and,

    2) I need to point out that adding flash to a ZFS storage heirarchy does not make the reconstruction of individual disks go faster.

    Note, however:

    2.1 - by itself ZFS does generally make reconstruction faster - it depends on how much of the disk is used; and,

    2.2 the combination of ZFS/flash does largely eliminate the liklihood that you'll have to tell users to wait until a disk reconstruction completes - especially if you mirror on JBODs.
  • Question

    How well does ZFS/NFS play together? Can it be cached (a la cacheFS)?
    Roger Ramjet
    • It depends


      For some discussion on one subset of issues.

      However.. in general I doubt that using a separate cache with ZFS makes much sense. I know you like to NFS mount things as a way of enforcing standardization on workstations, I have not tried this, so you'll have to experiment a bit, but my guess is that the right answer there is to get something like a thumper as an NFS server. That gives you ZFS with lots of smart cache (up to 128GB!) to share without the additional software (or cache) on the client or anywhere else.