How one business recovered from a RAID failure

How one business recovered from a RAID failure

Summary: The backups were out of date when both drives in a mirrored RAID physically failed, and the configuration information for another RAID 5 set was completely lost. Yet, this business bounced back overnight.

SHARE:

Businesses sometimes store their data in a redundant array of inexpensive/independent disks arrangement, or RAID, but when one or more drives fail, or the configuration for the RAID is lost, businesses can be put at risk.

There are several ways that a RAID can be configured. One of the most basic ways is in a RAID 1 arrangement, in which two drives are mirrored such that if one drive suffers a failure, the other drive can be used to restore the data.

Another arrangement is RAID 5, where at least three drives are used. In this setup, however, parity or stripe data of the information stored is used to verify content, as well as to provide a way to restore data. This stripe data is divided up over all drives, such that the loss of a single drive means that the missing information can be recovered, but at the cost of lower storage space.

In Uprising Beach Resort's case, it had followed fairly good practices. Its data was stored in a three-drive RAID 5 arrangement, and the operating system (and thus RAID 5 configuration) placed in a separate RAID 1 arrangement — a total of five drives. Additionally, a separate system was used to back up critical information.

However, disaster struck the resort when both drives in the RAID 1 configuration physically failed simultaneously, meaning that regardless of the state of the RAID 5 drives, it was no longer possible to access the data on it, because the configuration was lost.

The resort called upon Datec Fiji to assist in bringing the operating system on the downed RAID 1 configuration back to life just to gain access to the RAID 5 data cluster. After being unable to do so, Datec referred the matter to Kroll Ontrack for a different approach to rebuilding the RAID 5 cluster without the RAID 5 configuration information. During this time, the resort was unable to use its IT systems to track the billing of items to guests, check ins and check outs, and other such essential IT processes, making downtime increasingly damaging and placing increased stress on staff.

Additionally, although the resort had been backing up its data, a failure to test backups meant no one had noticed that for the past month, no such backups had taken place. It would have been possible for the resort to manually enter information, but the process was estimated to take weeks.

With the resort's business at threat, and couriers simply too slow for the business to wait, one of its staff members took the drives from the RAID 5 arrangement and jumped on a flight to Brisbane, where they were imaged in Kroll Ontrack's recovery lab — a recently opened specialist clean room facility.

Kroll Ontrack's Brisbane facility.
(Image: Kroll Ontrack)

When the drives were imaged, technicians found that one of them had several bad sectors, further complicating attempts to solve the issue, but within two hours, the recovery company was able to confirm that it would be possible to retrieve the information using a combination of intact data and parity data.

With the day ending, Kroll Ontrack's US and European teams took over the recovery process, working overnight on the images of the drive already taken by the Brisbane team, now no longer needing direct access to the physical hardware. By taking into consideration the order of the drives installed in the RAID 5, the recovery teams were able to calculate where the striping data on the hard drives should have been.

Critical data was then later uploaded via FTP, and, due to the earlier confirmation that the data would be ready, the resort was able to prepare the right environment to begin running its IT systems again.

While the recovery ended up being rather straightforward for the resort once Kroll Ontrack pieced the RAID 5 arrangement back together, senior data recovery lab engineer Tim Black told ZDNet that it isn't always that simple.

He said that when a single drive fails, the increased amount of reads and writes creates additional workload on the remaining drives.

"Unfortunately, while you have one drive down, you're actually significantly increasing the likelihood that you're going to have a secondary hard drive failure."

The case happens commonly enough that Black said a dual-disk failure of a RAID 5 arrangement is the typical RAID-related scenario that Kroll Ontrack sees.

Fortunately, all is not always lost in such a situation, but the possibility of achieving a full recovery often depends on what the customer has done. Black said that a combination of information from the failed drives and remaining healthy drives can be used to piece together a full picture, but the chances are higher if the drives are still "fresh".

This is because once a drive fails and is no longer part of the RAID, its parity data becomes more and more out of sync with the other drives.

"If a drive fails and then one week later, a secondary drive fails, bringing the whole RAID down; that first failed drive, the data on it is one week out of date or out of sync with the remainder of the drives. If we do have to incorporate that drive, it can lead to file corruption."

As a result, Black suggested that if a company finds itself in a situation where one drive has failed, it may be better to immediately back up critical data before proceeding with rebuilding the array, so as to reduce the stress on the remaining drives.

"Backing up a small amount of data is going to be less intensive on the remaining drives than the rebuild process is, so it's less likely that they'll have a failure during that process. Safety first would be to copy off any critical databases prior to restarting a rebuild."

If, however, a second drive fails during the rebuild, Black said that one of the worst things a customer might do is try to press ahead.

"It definitely makes this worse, and in some cases unrecoverable, or poorly recoverable with lots of corruption."

While Black's comments are good advice for any RAID that has a drive failure, as ZDNet's Robin Harris noted, no enterprise storage vendor recommends RAID 5 anymore. Harris opined in 2007 that by 2009, RAID 5 would be obsolete due to larger drives increasing the chances that a RAID array will not be able to be rebuild due to the increasing probability of read errors.

The solution that enterprise storage vendors have adopted is to use double distributed parity to allow for the failure of two drives at once, or a RAID 6 arrangement. Moore's law will eventually catch up, however, and Harris believes that RAID 6 usefulness will become debatable by 2019.

Topics: Data Management, Hardware, Disaster Recovery

Michael Lee

About Michael Lee

A Sydney, Australia-based journalist, Michael Lee covers a gamut of news in the technology space including information security, state Government initiatives, and local startups.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

10 comments
Log in or register to join the discussion
  • RAID 10 should be the only one used

    I have many years of experience and my conclusion is that anybody who chooses RAID 5 deserves a disk failure and loos of data.

    For servers: RAID 10.

    For clients: RAID 1.

    RAID 0 should be made illegal.
    anonymous
    • Yeah

      I second that
      daves@...
    • raid0 is for performance not redundancy

      Raid0 striping is purely for performance.
      You can combine striping with mirroring.
      warboat
  • RAID0

    previous to SSD drives, RAID0 had its uses

    for example, I had twin 37GB WD Raptor drives in RAID0 for my OS and programs, which gave me good read speeds and enough capacity.
    Mytheroo
    • raid0 with SSDs

      Have a look at something like the kingspec 8 core SSD which is using 8 discreet SSDs in raid0 on pcie bus to get better than sata performance by striping msata drives.
      warboat
  • This kind of problem is no fun

    A few years ago I had to deal with a dead RAID system. It wouldn't have happened though if someone knew what the lights meant. Months before the crash I noticed a flashing light and told someone at my main office but they said it's supposed to do that. The server was working fine since the other drives were okay. Then another drive died. Too bad the RAID system wasn't able to email the people in charge of the severs and tell them a drive died. Maybe that's possible now.
    BrianC6234
  • Root issue was not raid

    There are several failures noted in this case, the biggest being the complete lack of focus on common sense IT practices. Their biggest problem causing the loss of data was not the fact they used a raid 5 array but is using the OS to set it up and manage it. The article is dead wrong where it says they were following 'good' practices. Unless your in a lab or just plain don't care about the data, you should never use the OS to configure and control the raid settings. Any business doing this is guaranteed to have a failure. Sounds like they spent more money recovering then they would have if they had just hired and properly funded a real IT department.
    rod@...
  • something's amiss

    What RAID controller was used to create that RAID5? (Most RAID controllers and software RAID systems are able to migrate / import existing RAID sets w/o having to do anything special.)
    Alex Gerulaitis
  • Brand of drive

    You can bet the drives that failed were branded Seagate ;-)
    NewZed
  • Agree About Pressing Ahead

    I think the worst thing anyone can do is press ahead on their own with any RAID recovery case. Here is a check list of reasons not to try it on your own. http://www.sertdatarecovery.com/raid-recovery/raid-5-data-recovery Of course it usually doesn't matter and they never understand until it's too late.
    BillyMarshall