How one business recovered from a RAID failure

The backups were out of date when both drives in a mirrored RAID physically failed, and the configuration information for another RAID 5 set was completely lost. Yet, this business bounced back overnight.
Written by Michael Lee, Contributor

Businesses sometimes store their data in a redundant array of inexpensive/independent disks arrangement, or RAID, but when one or more drives fail, or the configuration for the RAID is lost, businesses can be put at risk.

There are several ways that a RAID can be configured. One of the most basic ways is in a RAID 1 arrangement, in which two drives are mirrored such that if one drive suffers a failure, the other drive can be used to restore the data.

Another arrangement is RAID 5, where at least three drives are used. In this setup, however, parity or stripe data of the information stored is used to verify content, as well as to provide a way to restore data. This stripe data is divided up over all drives, such that the loss of a single drive means that the missing information can be recovered, but at the cost of lower storage space.

In Uprising Beach Resort's case, it had followed fairly good practices. Its data was stored in a three-drive RAID 5 arrangement, and the operating system (and thus RAID 5 configuration) placed in a separate RAID 1 arrangement — a total of five drives. Additionally, a separate system was used to back up critical information.

However, disaster struck the resort when both drives in the RAID 1 configuration physically failed simultaneously, meaning that regardless of the state of the RAID 5 drives, it was no longer possible to access the data on it, because the configuration was lost.

The resort called upon Datec Fiji to assist in bringing the operating system on the downed RAID 1 configuration back to life just to gain access to the RAID 5 data cluster. After being unable to do so, Datec referred the matter to Kroll Ontrack for a different approach to rebuilding the RAID 5 cluster without the RAID 5 configuration information. During this time, the resort was unable to use its IT systems to track the billing of items to guests, check ins and check outs, and other such essential IT processes, making downtime increasingly damaging and placing increased stress on staff.

Additionally, although the resort had been backing up its data, a failure to test backups meant no one had noticed that for the past month, no such backups had taken place. It would have been possible for the resort to manually enter information, but the process was estimated to take weeks.

With the resort's business at threat, and couriers simply too slow for the business to wait, one of its staff members took the drives from the RAID 5 arrangement and jumped on a flight to Brisbane, where they were imaged in Kroll Ontrack's recovery lab — a recently opened specialist clean room facility.

Kroll Ontrack's Brisbane facility.
Image: Kroll Ontrack

When the drives were imaged, technicians found that one of them had several bad sectors, further complicating attempts to solve the issue, but within two hours, the recovery company was able to confirm that it would be possible to retrieve the information using a combination of intact data and parity data.

With the day ending, Kroll Ontrack's US and European teams took over the recovery process, working overnight on the images of the drive already taken by the Brisbane team, now no longer needing direct access to the physical hardware. By taking into consideration the order of the drives installed in the RAID 5, the recovery teams were able to calculate where the striping data on the hard drives should have been.

Critical data was then later uploaded via FTP, and, due to the earlier confirmation that the data would be ready, the resort was able to prepare the right environment to begin running its IT systems again.

While the recovery ended up being rather straightforward for the resort once Kroll Ontrack pieced the RAID 5 arrangement back together, senior data recovery lab engineer Tim Black told ZDNet that it isn't always that simple.

He said that when a single drive fails, the increased amount of reads and writes creates additional workload on the remaining drives.

"Unfortunately, while you have one drive down, you're actually significantly increasing the likelihood that you're going to have a secondary hard drive failure."

The case happens commonly enough that Black said a dual-disk failure of a RAID 5 arrangement is the typical RAID-related scenario that Kroll Ontrack sees.

Fortunately, all is not always lost in such a situation, but the possibility of achieving a full recovery often depends on what the customer has done. Black said that a combination of information from the failed drives and remaining healthy drives can be used to piece together a full picture, but the chances are higher if the drives are still "fresh".

This is because once a drive fails and is no longer part of the RAID, its parity data becomes more and more out of sync with the other drives.

"If a drive fails and then one week later, a secondary drive fails, bringing the whole RAID down; that first failed drive, the data on it is one week out of date or out of sync with the remainder of the drives. If we do have to incorporate that drive, it can lead to file corruption."

As a result, Black suggested that if a company finds itself in a situation where one drive has failed, it may be better to immediately back up critical data before proceeding with rebuilding the array, so as to reduce the stress on the remaining drives.

"Backing up a small amount of data is going to be less intensive on the remaining drives than the rebuild process is, so it's less likely that they'll have a failure during that process. Safety first would be to copy off any critical databases prior to restarting a rebuild."

If, however, a second drive fails during the rebuild, Black said that one of the worst things a customer might do is try to press ahead.

"It definitely makes this worse, and in some cases unrecoverable, or poorly recoverable with lots of corruption."

While Black's comments are good advice for any RAID that has a drive failure, as ZDNet's Robin Harris noted, no enterprise storage vendor recommends RAID 5 anymore. Harris opined in 2007 that by 2009, RAID 5 would be obsolete due to larger drives increasing the chances that a RAID array will not be able to be rebuild due to the increasing probability of read errors.

The solution that enterprise storage vendors have adopted is to use double distributed parity to allow for the failure of two drives at once, or a RAID 6 arrangement. Moore's law will eventually catch up, however, and Harris believes that RAID 6 usefulness will become debatable by 2019.

Editorial standards