The end of RAID

The end of RAID

Summary: Low latency storage, fast multi-core CPUs, high-bandwidth interconnects and larger disk capacity is ending the reign of costly RAID controllers in favor of more elegant data protection. A report from the front lines of storage innovation.

SHARE:

Low latency storage, fast multi-core CPUs, high-bandwidth interconnects and larger disk capacity is ending the reign of costly RAID controllers in favor of more elegant data protection.

A report from the front lines of storage innovation at the OpenStorage Summit.

Storage is at an inflection point. Low latency mass storage - flash and DRAM, faster interconnects - including PCIe3, multi-core CPUs has broken some pieces of the old storage stack. This means new opportunities for a fundamental rethinking of storage architectures.

What has broken? At a high level today's storage stack latency is too high. When 4 ms was fast, who cared about another ms of latency? But with sub-millisecond flash and NVDRAM storage that is no longer acceptable.

At a lower level, the storage stack architecture is broken. Kernel level locking and context switching that were "fast" compared to disks are too slow today.

Just as CPUs are going multi-core to improve concurrency, so must storage. Instead of kernel-level locking, we need to go to application-level locking to maintain multiple lock queues.

And this affects RAID how? Back when RAID was taking off, 1 GB disks were the rule. Rebuilds didn't take very long. But now SATA drives are 1,000x, 2,000x and now even 3,000x larger.

As disks gets larger the time it takes to rebuild a failed disk gets longer too. Many hours at a minimum, often a day or more and sometimes even a week or more.

During rebuilds the system slows because every disk is seeking and the controller is rebuilding and writing the lost data. That hurts application availability.

RAID was designed to solve 2 problems: data availability using unreliable cheap disks; and, improved performance with slow - also cheap - disks. There was no flash, DRAM cost $25/MB and cheap SCSI drives cost a fraction of what an enterprise 9" drive cost.

So we need to simplify. We can use flash or NVDIMMs to quickly handle metadata requests. If we can afford it we can even move hot files to non-volatile storage on very low latency PCIe busses.

Which means that disks are storing our less active data. For what a RAID controller costs we can buy several terabytes of capacity and store 2 or more copies of everything. When a disk fails a copy is much faster than a rebuild.

For every pipe a queue Intel's Sandy Bridge I/O architecture promises 40 1GB/sec PCIe v.3 lanes per CPU. We can give every app its own I/O queue and fast PCIe storage if can afford it.

Imagine a 64-core chip running 45 apps with 40 64 GB PCIe storage cards. That's mainframe/supercomputer power in a commodity box.

The Storage Bits take Just as RAID was a smart play on 90's technology, we'll see new storage architectures taking advantage of today's options. While RAID won't disappear overnight, its days are numbered.

Storing data is cheap and getting cheaper. Moving data is costly in time and lost performance. New architectures will reduce movement by consuming capacity.

Latency reduction is a deeper problem. Replacing several decades of plumbing isn't simple, but it must be done. More on that later.

Comments welcome, of course. I owe a debt of gratitude to Richard Elling and Bill Moore, formerly of Sun, and their presentations at the Summit.

Topics: Storage, Hardware, Web development

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

29 comments
Log in or register to join the discussion
  • details

    The main point of your article seems to be, "For what a RAID controller costs we can buy several terabytes of capacity and store 2 or more copies of everything. When a disk fails a copy is much faster than a rebuild."

    Can you explain how this is implemented? Take a real example, I have a 48 bay server with 2TB drives. Without a RAID controller, how do I physically connect the drives? The motherboard is about 44 SATA ports short.

    Are you suggesting that I buy additional hard drives to store the duplicate data? If so, what do I put the drives in? Do I buy additional hardware? That costs money and rack space is at a premium. My server is already fully populated. The server has 2 Areca RAID controllers. They cost around $1200, that is only going to get me eight 2TB drives, not enough to store 2 copies of my data.

    What solutions are available that allow me to seamlessly store 2 copies of every file? Or do I need a clustered solution? This seems like an immature field at the moment. Several attempted solutions, most of which (in my price range) are not ready for production or are from small companies that may disappear or get bought at any moment. Are there any that are open source AND ready for production use?

    A RAID rebuild on my server takes a very long 2.5 days so I'm honestly curious how this would be implemented.
    bjornborg
    • Uh.. You DO realize RH was talking about the future... Right?

      @bjornborg
      Seriously.. I don't think we've quite gotten to the point where we've got 64 core CPUs on the market quite yet:

      [b]"Imagine a 64-core chip running 45 apps with 40 64 GB PCIe storage cards. That?s mainframe/supercomputer power in a commodity box." [/b]

      In other words, he's just dreaming up this technology - and some or all of it may yet come to pass - eventually. Maybe 10 years from now. maybe 20. Hard to say. Intel and AMD's road maps don't quite go out that far.
      Wolfie2K3
  • The answer of course is next gen file systems ...

    File systems like ZFS and btrfs eliminate the need for RAID by emulating RAID on a file by file basis. This provides all the functionality of RAID while eliminating the overhead of expensive hardware and tedious re-syncs. Additionally the continuously protect their data using checksums and background bad blocking and file system audits. And btrfs is fairly patent safe as it is based on the longstanding ReiserFS and simply implements features that Hans Reiser envisioned and documented long ago. ZFS, on the other hand, has an achilles heel as it emulates already patented methods to achieve its next gen capabilities. But these two emerging technologies will kill RAID over time.
    George Mitchell
    • RE: The end of RAID

      @George Mitchell There's still the thorny problem of *how* you connect those physical drives to a computer. Many of us use hardware RAID controllers mainly because there simply aren't enough SATA/SAS ports on a motherboard to connect the hard drives to.<br><br>Thus, RAID controllers aren't only about tedious algorithms and complex redundant setups... they're simply there to connect 8 or 12 or even more hard drives to a single computer.<br><br>Unless motherboards will start coming with 16+ ports for storage, I don't see RAID controllers going away so soon, even if you use ZFS/btrfs.
      Alexstrasza
      • RE: The end of RAID

        @Alexstrasza You could just use the RAID controllers in JBOD mode. If Harris is right, a lot of companies would just make multi-port controllers without the RAID software and hardware costs (extra memory, faster onboard processors, dedicated logic, etc.).

        If Sandy Bridge has "40 1GB/sec PCIe v.3 lanes per CPU", you could probably do away with the controller altogether with some extra connection hardware on the motherboard. All those lanes have to leave the CPU if nothing else for other parts of the motherboard, so it wouldn't take very sophisticated dedicated logic (by today's standards) to connect them to external storage.
        wilback
  • RE: The end of RAID

    Good.
    james347
  • Keeping copies ( aka mirroring ) is RAID.

    If you make copies, you have to keep copies synced. that is what raid does for you.

    You describe a world where you have volatile memory, non volatile chips and high capacity drives, the later used for less active data. But what will determine which data is highly volatile and which is less active. You cannot rely on the developper/software there. This would incur too complicated software, error prone and costly to implement.

    You cannot rely on the operating system either. this would make a system too complex, and too dependant on the actual physical configuration. What you will have to do is to rely on a controller, exactly what you wanetd to remove from the equation.
    s_souche
    • RE: The end of RAID

      @s_souche That's exactly what ZFS does. In fact, throughout its history the whole attitude of Sun was that all storage should controlled by the central CPUs. I believe even their dedicated storage boxes were controlled by full-blown Solaris software.
      wilback
  • RAID is still here. Its just

    Server virtualization has changed the way we use storage arrays but it hasn't killed them. Here is the scenario: All of our servers (regardless of what OS they are running) are virtual machines using the servers for CPU and memory but a fiber channel SAN for storage. The SAN itself is broken up into multiple RAID 5 arrays and there is lots of excess disk capacity. If we were to have a disk failure - the VM's would still be running and we would just transfer them to another RAID 5 array within the SAN. The latest versions of the VMWare products allow you to do this with the VM's running and zero down time. When the new disk arrives we would not bother rebuilding the problematic array. We would just delete it, plug in the new drive, the re-create the array. There is no need to do a rebuild as long as there is enough excess capacity on other arrays to hold all the VM's.
    cornpie
    • RE: The end of RAID

      @cornpie That seems like a lot of expensive excess capacity to me. I'm not sure what you mean by deleting the array, and then re-creating it. Even if you copy whole stripes at a time to avoid the RAID5 read-modify-write cycle, it's still a lot of data being shuffled about between RAID5 arrays, instead of just internally to a single one when a disk gets rebuilt.
      wilback
  • Maybe/Maybe Not

    @bjornborg (nice name by the way !). I agree, the author should take a realistic example. Price say a Dell server (or other vendors hardware if preferred) with, say a Dell 3200 Sata storage array. Then price the same system with some other, optional storage method as he describes and compare the actual costs. Sometimes thought exercises are great, but they may or may not pan out. As costs of newer technology come down, it may be that the proposed solution eventually gets to become realistic, but I am not sure I a sold at this point. Just as everyone said SANS were the only way to go (but if you look at the Cost per Giga-byte vs the performance bottle neck that can arise they may not be some compelling). One size/solution does not always fit all.
    jkohut
  • Not a well thought out idea

    As other have pointed out, perhaps not as clearly, is that Robin is proposing a solution that simply shifts the complexity from one place to another - it doesn't eliminate it. If anything, Robin makes a case for an enhanced RAID1.
    croberts
  • RE: The end of RAID

    Hmmmm, so we are talking about speed and the answer is to do a full copy of all data to a second drive? In comparison Raid 5 uses one eight of data to accomplish the same thing. Whether software or hardware, RAID is RAID, you are comparing writing at least two bytes for every one byte stored as compared to writing one bit for every byte stored. Not only are you getting more storage capacity for every disk you buy but you drastically increase speeds and redundancy by combining Raids levels, like striping two Raid 5 arrays.
    So if you are talking speed I see no comparison, if you are talking Redundancy you get the same, if you are talking capacity there is no comparison. Score 2 for Raid, with one Tie, I say RAID still wins.
    Another issue you failed to touch on is the uncertainty of flash. Sure there are compensating features built into flash to handle it's extremely high error rate in storing data but wouldn't it be nice to add a little extra to that by having a parity bit/drive to verify the flash is correct? I say yes and the best way to do that is to Raid flash drives to, at a minimum you double the speed of flash, add redundancy and save on capacity.
    blittrell
    • RE: The end of RAID

      @blittrell Did you account for the recovery/rebuild time of failed drive in your pros/cons? I think that was the main point of why RAID isn't as good as it was before. Especially since larger drives are more likely to have failures than smaller drives. With the potential for each drive to be 1TB plus, the likelihood of more than one drive having a failure is at a percentage that makes the premise of RAID not entirely attractive. The promise of RAID5 is that you can still work while rebuilding a single failed drive. That's great, as long as there is no possibility that the rebuild will fail due to yet another drive failing during the rebuild time, which could be several hours or days. "Yeah, but that won't happen." OK, I thought so, too, but consider that large drives have roughly the same percentage of bit failure as small drives. Note: I'm not saying catastrophic failure. I'm saying that the bits lost per GB is roughly the same for large drives as small drives. That could theoretically (big leap, but...) mean up to 1000x more bits lost in a 1TB drive versus a 1GB drive. How many bits does RAID need to lose in order to fail rebuilding a RAID5? and what if that 1000x more bits lost is also on the other drives that are the source for rebuilding the RAID?
      crythias
      • RE: The end of RAID

        @crythias The problem with your argument is that this is a fundamental problem no matter what technology you use, if you mirror data, byte, file or bit based you run into the same situation, if the drive fails while re-syncing all the data is lost. The promise of hardware raid is that the controller offloads the work of rebuilding and if it was built well it would prioritize new traffic before rebuilding traffic. I am not saying this is actually done in new controllers but it should be.

        Typically rebuild times for a 750GB drive takes about a day, 2 TB would probably take 2 and a half days, what is the alternative? Run data on single drives just write to multiple places you still have the issue of copying that data over again and if the good drive fails you lose everything as well as a significant performance hit while copying.
        As far as the bit failure in drives goes, you are right there are bit failures there but are they as high as they are in flash? Last I heard bit failures in flash were hovering around 20-30%, on-board controllers for drives compensate for the bit loss, which is a lot less then flash, as does flash memory so I fail to see where we see the difference.
        The other concept you forget, or did not think of is collision, what happens if, as you say, one drive has a failed bit or a failed few bits, and you are writing to several drives, how do you determine which file is right. Are you now looking at human intervention to make the logical choice of what file is good? How about the users, if they open a file and it is corrupt are they going to want to go to a alternate storage to open then manually copy it over to the other storage? How would the software used to copy to multiple locations determine which files are corrupt and which are not, then what happens if a drive starts corrupting all data, will there be a sync procedure that corrupts the second drive? There are so many questions unanswered here.
        Is RAID perfect? No, but I think it is the best thing we have for now and the near future.
        blittrell
  • I agree

    I would rather have multiple drives store my complete data than use raid... I always thought this was stupid not to do. Raid with auto backup should be the preferred option in my opinion.

    First off my problem with raid is the vendors that produce raid cards... They want it to seem like its some kind of Intersteller Galaxy Calculating algorthom and they make it in such a way that if the tinest rock gets in the way of this caclulation everything is messed up. Whats the tiny rock? A different raid card, connecting drives to the wrong port on accident after a raid card fails etc. Raid Cards should all be using a standard and the only difference between them should be speed and administration features. Software raid cards pretty much helped make the problem bigger.

    I'm getting a little bit off topic here... lol... anyway all raid cards should support BACKUP. I'm not talking about Raid Backup where if N number of drives fail blah blah blah... I'm talking about if ALL raid drives fail and the BACKUP drive is fine, the whole thing should still work!And it should be able to rebuild multiple drive failures off of the BACKUP drive.

    So the way I think a backup drive or N number of backup drives should work is...
    #1 Its not a raid drive, the raid control just uses it
    #2 Its not vendor specific
    #3 I should be able to unplug this backup drive and plug it into another machine and it boots up perfectly fine (minus maybe some drivers the OS needs)
    #4 Backup drive should not be dependant on size as long as the backup can fit.
    #5 Multiple full backup drives
    #6 Incremental backup, one of the biggest problems with RAID is that all the drives are pretty much used the same amount of times so they are likely to fail around the same time. If the backup drive is schedule to process "what has changed today" every night then the usage of that drive can be extended much longer then any of the other drives in the raid system. Sure for a database 1 day of a production system is like deleting 1 day of business. However for most other systems 1 day of content can be reproduced or your just set back 1 day of development or whatever it is.
    #7 The drive should be turned off when not in use to save power and protect against drive wear.
    #8 backup should work off of a "snapshot"
    #9 Restore from backup should be an option, and it builds the array from the backup drive. Basically Copy-2-raid.

    I currently think drives should be a mix of SSD and DISK, built into the same drive (I think seagate is making something like this now) so you should have a 10 gigs of SSD or so for your buffer or cache and then all the 90% of your files you dont access daily should be on the disk. and the SSD should be backed up to disk regularly (as a background process of the drive it self).

    My prefered development setup would be something like RAID0 stripped across muliple drives for speed reading. An SSD hooked up to the raidcard should be a buffer between the backup and raid and the diskbackup should be specifiable to when the "new" information is written to it, for instance hourly, daily, weekly. Ofcourse if you exceed the amount of cache in changes because your doing massive changes on the system it would write early. And it would also have a second backup drive that is weekly. If NTFS is used it should also store any additional file versions in the backups to revert back to specific other files etc.

    Maybe I should have written this in an article or something instead of a comment post... LOL Sorry I tried to keep it small

    And yes no one should ever mention that murdering tech guy again and his tech should be discreated as related to him, two wrongs sometimes make a right =\
    x21x
  • Ummmm... RAID 1 anyone?

    Seems the article is claiming "RAID is dead!... Long live RAID 1!"
    dave_helmut
    • RE: The end of RAID

      @dave_helmut

      Yea.. this was my exact thought when I read this. Robin is essentially advocating for software based RAID1. Big whoop.

      What Robin is talking about has been available for years. Take IBM's SystemP for example. The VIO LPARs allow you to carve out system time to service IO for other LPARs. Parity can be handled at the VIO level or deeper on an array or SAN level.

      Where the rubber hits the road in today's enterprise, storage admins leverage storage virtualization tools to keep the actual parity sets to 3+1 then aggregate them with virtualization into a big bucket of extents waiting to be used. Incredible performance, incredible uptime. Long live RAID!
      civikminded
  • RAID isn't going away - at all!

    It's a simple technology - easy to understand and implement. Author is missing the fact that SSD drives will allow for resurgence of RAID5 - the high rebuild times are not going to be a problem until we get into 8, 10 and 12 TB SSD drives. As for slow & large SATA drives, higher parity RAID already exists:
    http://blogs.sun.com/ahl/entry/triple_parity_raid_z
    http://queue.acm.org/detail.cfm?id=1670144

    More efficient technologies, like Reed Solomon codes, to exists, but they are more difficult to implement, so it will take time for that to become popular.
    packetracer
    • RE: The end of RAID

      @alex.georgiev@... I think Harris was pointing out that even with SSD, rebuild times are still going to be a multiple of SSD access times -- that's the nature of RAID5 rebuilds. If you bought expensive SSDs for their access times, you don't want to slow it down by multiples of that time, even if it is a lot faster than with HDDs.

      Another factor is that RAID5 seriously affects your controller's memory bandwidth. The read-modify-write cycles that must all be done in the central memory of the controller mean that data has to travel many times over your controller's data paths before the operation is complete (for a generic write, you have to read from all the disks in a stripe into memory, XOR all that data together, then write out the new parity and new data). Compare that with a normal write where the data is put into the controller's memory once, and read out once. When your storage was slow HDDs the RAID5 memory bandwidth penalty was relatively small, but with SSDs and NVRAM it starts to become the major roadblock.
      wilback