Troubleshoot drive failures in seven steps

tutorial Knowing how to revive a failed hard drive is a key part of an IT pro's job, and having a reliable set of troubleshooting guidelines can increase the odds of a full recovery.
Written by Faithe Wempen, Contributor

When a hard disk fails and the computer doesn’t boot, the frenzy to save important company data ensues. When faced with such a problem, don’t panic. Just remember these simple hard drive troubleshooting tips.

Here’s a quick and proven hard disk troubleshooting process. With each point, ask yourself the question(s) that follow.

  • Physical connectivity--Is the drive receiving power? Is it plugged into the PC by a correctly connected ribbon cable? For an IDE drive, are its jumpers set correctly? With a SCSI drive, are its SCSI termination and ID set correctly?
  • BIOS setup--Does the BIOS see the drive?
  • Viruses--Does the drive contain any boot sector viruses that need to be removed before continuing?
  • Partitioning--Does FDISK find a valid partition on the drive? Is it active?
  • Formatting--Is the drive formatted using a file system that the OS can recognize?
  • Drive errors--Is a physical or logical drive error causing read/write problems on the drive?
  • Operating system--Does your OS have a feature that checks the status of each drive on your system? If so, what is that status?

1. Checking physical connectivity
To work properly, a hard drive needs power and a connection via a ribbon cable to the PC. If a drive doesn’t work after moving it to a new PC, after physically moving the PC, or after the cover has been taken off, start your troubleshooting by checking the physical connectivity. It’s possible for plugs to jiggle loose when moving a PC, and it’s easy to uproot a ribbon cable connection when pulling circuit boards or performing other maintenance tasks inside the case.

A hard disk works with any Molex connector from the PC’s power supply. Make sure the plug is fully inserted. Molex connectors require a lot of pressure to fully insert, and even more pressure to remove, so don’t be afraid to push hard or pull, as the case may be. Just make sure you handle the plastic connector, and do not try to push or pull the wires.

As the PC starts up, place the palm of your hand on the flat part of the hard disk. If you can detect any vibration, the drive probably has power. If there’s no movement at all, either the drive’s physical mechanism is shot or the Molex connector you have selected is faulty. Try using a different connector before assuming the drive has a problem.

Systems like the AT/LPX have a small connector that runs from the front of the case to the hard disk. On ATX systems, it runs from the motherboard to the hard disk. This enables the LED on the case to illuminate when the hard disk is in use. Don’t rely on that LED as a positive indicator as to whether the hard disk is receiving power. The light could be burned out, the wire disconnected, or the drive might be receiving power but not be connected correctly to the PC.

The other physical requirement for a drive is the PC itself. If it’s an IDE model, the drive should be connected via a ribbon cable to the IDE bus on the motherboard. Connections can also be made with a SCSI or proprietary expansion card. Secure both ends of the ribbon cable connector and make sure the connector is covering all pins. On systems where the pins are bare instead of surrounded by a plastic ridge, it’s easy to offset the connector by a row or two on the pins. If the drive is getting power but the BIOS can’t find it, try a different ribbon cable; the one in use might have a broken wire or other flaw.

Note that there are different types of hard disk ribbon cables. UltraDMA 66 and above drives require 80-wire cables. If you use the 40-wire type, the drive will be limited to UltraDMA 33 performance.

The red stripe on the ribbon cable must match up with Pin 1 on both the drive and the motherboard or expansion card. Sometimes, though, it’s not easy to locate Pin 1. Look for tiny numbers at one end of the connector. If you see a 1 or 2, that’s the end with which the red stripe should be matched. Some connectors are notched on one side while the ribbon cables have a tab that fits into that notched area. However, this isn’t always the case. Unlike with floppy drives, where the drive light stays on even if you have the ribbon cable backward, there is no simple way to tell whether you have the cable backwards. Without the notched connectors, your only choice is to use the trial-and-error method.

Checking jumper settings
On an IDE hard disk, one or more jumpers on the drive must be set to determine its Master/Slave status. This setting isn’t usually an issue in an existing hard disk installation that suddenly doesn’t work anymore, but it can cause problems when you move a drive from one PC to another.

Depending on the drive, the following jumper settings may be available:

  • Single--Use this setting when the drive is the only one on that IDE subsystem; that is, the only one on that ribbon cable. Not all drives have a Single setting; if there is none, use the Master setting instead.
  • Master (MS)--When there are two drives on the IDE subsystem and the other drive’s jumpers are set to Slave, or if this is the only drive on the subsystem and it doesn’t have a separate Single setting, use this setting.
  • Slave (SL)--Use this setting when there are two drives on the IDE subsystem and the other drive’s jumpers are set to Master.
  • Cable Select (CS)--If you are using a cable that relies on the device positioning to determine its Slave/Master status, use this setting. This setting is uncommon.

Checking SCSI termination
If the machine uses a SCSI drive, there are two factors with which to be concerned: termination and ID. These settings are not an issue when troubleshooting a drive that has suddenly gone bad in an existing system, but if you are moving a drive from one system to another and it doesn’t work in the new system, improper SCSI settings may be the culprit.

If this is the last SCSI device in the chain, it must be terminated. Termination methods vary. On some devices, you set termination with an extra jumper; on others, you use a cap or plug over a connector. On most hard disks, you terminate using a jumper setting.

SCSI-based drives usually have jumpers just like ATAPI ones, but instead of setting the Master/Slave status, they assign a SCSI ID number to the device. Some SCSI devices have a wheel or button instead of jumpers with a little window indicating the setting, but this is uncommon on a hard disk.

There can be up to seven SCSI devices on a single narrow SCSI bus, and up to 15 devices on a wide SCSI bus. There are either eight or 16 addresses in total, depending on your system. The host adapter takes one of those addresses, leaving seven or 15 for the remaining drives. Usually, the host adapter claims the highest number for itself.

The SCSI ID comes from a binary representation of the jumpers. For example, on a device with three SCSI jumpers and all of them are without jumper settings, the ID would be 000b (b stands for binary here), or 0. An ID of 001b would be 1; 010b would be 2; and so on.

The problem lies in the fact that some manufacturers set the jumpers to read from left-to-right, while others use right-to-left. So on one drive, the leftmost jumper set would be 1, while on some other drive, the rightmost jumper set would be 1. Check the drive’s label for information about which way the drive works. If all else fails, try the manufacturer’s Web site.

2. Checking BIOS setup (IDE only)
In most modern systems, the BIOS can automatically detect your hard disk, so no special BIOS setup is required. However, if you are working with an older or quirky BIOS, you might need to enter the BIOS setup program and change the drive’s IDE channel (such as Primary Master or Primary Slave, for example) from None to Auto so the BIOS will attempt to find and identify the drive.

On an old BIOS, you occasionally may need to select User as the drive type and manually enter the drive’s settings. Automatic detection of IDE devices was part of the ATA-3 standard, released more than 10 years ago, though, doing so would be rare.

Some BIOSs also have a separate Detect IDE Devices utility built in. If the BIOS contains such a utility, you can use it to prompt the BIOS to detect the new hard disk. This comes in handy when you aren’t sure whether or not the drive is working, because you can get an answer immediately rather than rebooting and waiting to see whether the BIOS finds the drive on startup.

3. Virus checking
If you’ve come this far in the troubleshooting process and the drive still isn’t working, check for viruses. A drive containing a boot-sector virus will not only malfunction, it can spread the virus to the disk you boot from, such as your emergency startup disk.

On a system that you know is good and that has an antivirus program installed, update the virus definitions, and then make a virus-checking boot disk. Write protect it, and then use it to start the system containing the nonworking hard disk and check it for errors. If the drive is not partitioned and formatted, the boot disk might not be able to check the data area of the drive. That’s okay for now; just let it get as far as it can before moving on to the next step, checking the partition.

4. Checking for a valid partition
If the BIOS can see the drive but the drive isn’t working, make sure the drive is partitioned. Use FDISK, a command-line utility you’ll find on a Windows 9x/Me startup disk, to check. Boot from the write-protected startup disk and type FDISK. When asked whether or not you want large disk support, type Y.

If the active partition’s type is FAT, FAT32, or NTFS, it should be recognized by the operating system. One exception would be if you put an NTFS drive into a Windows 9x/Me system. The OS wouldn’t recognize the NTFS because it doesn’t support NTFS, not because it was partitioned incorrectly.

If it is a partition problem, you have two choices: Try to recover the data using a disk recovery program, or give up on the data, delete the partition, and re-create it in FDISK. If you want to try recovery first, see the section below on Advanced Data Recovery Options.

If you want to delete the partition and re-create it, return to the FDISK main screen by pressing [Esc] and deleting the partition (option 3 on the screen), and then return to the main screen again and create a partition (option 1 on the screen). After using FDISK to create or delete partitions, you must reboot the machine before doing anything else.

5. Checking drive formatting
If FDISK recognizes the drive and it has a valid partition type, you should be able to view the drive’s content from a command prompt via your startup disk, or from the Recovery Console in Windows 2000 or XP. Change to that drive by typing its drive letter followed by a colon and pressing [Enter]. Then, display a list of files on the drive with the DIR command.

If you see a message about an invalid media type, the drive is probably not formatted using a file system that your OS recognizes. You can either try a data recovery program, or you can give up on the drive’s data and reformat it with the FORMAT command.

6. Fixing physical and logical drive errors
Let’s assume at this point that your OS finds the drive and can read some files on it, but not all of them. Maybe you’re receiving read or write errors, or certain programs aren’t working right. The problem is likely a physical or logical disk error.

A physical disk error is a bad spot on the drive. It can result from physical trauma to the computer, like knocking it off of a table while it’s running.

A logical disk error is a discrepancy between the two copies of the file allocation table (FAT) on the disk, or a discrepancy between the FAT’s version of what clusters are stored on the drive and the reality of actual storage. Such errors are typically caused by improperly shutting down the PC or abnormal program termination.

A message about a data error while reading or writing the drive is probably a physical error. Logical errors are manifested in many different ways, not always directly attributable to the disk itself. For example, certain programs might fail to run or might lock up after starting. Such a problem could mean a memory parity error or even a bad cooling fan; you never know until you check the system and eliminate the possibilities.

It’s best to try the simplest solution first, so run a disk-checking program. Windows 9x/Me/2000 comes with ScanDisk, which will check for both physical and logical errors. Windows XP comes with a similar utility called Check Disk. In Windows XP, access Check Disk from the Tools tab of the drive’s Properties sheet. In early versions of DOS, a command-line utility called CHKDSK does the same thing. Use it with the /F switch to fix any errors it finds.

7. Checking and reactivating disks in the Windows 2000/XP OSs
Windows 2000 and Windows XP both have a Disk Management feature that checks the status of each drive on your system. This utility allows you to convert to dynamic disks, change space allocation, and much more.

With Disk Management, the most important thing to check is the status of each drive. The Windows Disk Management application will display the drive's status. If a drive reports that it is offline or a status other than Healthy, right-click it and choose Reactivate Disk.

Because so much is stored on hard disks, knowing how to revive a failed hard drive is a critical function for technology professionals. Having an effective guide to the recovery process might mean the difference between a total loss and full recovery. With this seven-step process, though, you’ll be ready to tackle most hard disk errors that arise.

This article first appeared in TechRepublic's TechProGuild section.

Editorial standards