How do I easily repair a single unreadable block on a Linux disk?

My Linux system has started throwing SMART errors in the syslog. I tracked it down and believe the problem is a single block on the disk. How do I go about easily getting the disk to reallocate that one block? I'd like to know what file got destroyed in the process. (I'm aware that if one block fails on a disk others are likely to follow; I have a good ongoing backup and just want to try to keep this disk working.)

Searching the web leads to the Bad block HOWTO, which describes a manual process on an unmounted disk. It seems complicated and error-prone. Is there a tool to automate this process in Linux? My only other option is the manufacturer's diagnostic tool, but I presume that'll clobber the bad block without any reporting on what got destroyed. Worst case, it might be filesystem metadata.

The disk in question is the primary system partition. Using ext3fs and LVM. Here's the error log from syslog and the relevant bit from smartctl.

smartd[5226]: Device: /dev/hda, 1 Currently unreadable (pending) sectors

Error 1 occurred at disk power-on lifetime: 17449 hours (727 days + 1 hours)
... Error: UNC at LBA = 0x00d39eee = 13868782

There's a full smartctl dump on pastebin.


Solution 1:

You could try hdparm --write-sector <LBA> /dev/ice.

I don't know any other way of doing this - you need to manually convert the LBA into filesystem blocks (as you've already found)

Solution 2:

I used to write disk firmware for WD, and I once wrote the firmware which reassigned bad blocks.

First, most bad blocks are detected on reads, not writes. Writes are done blindly, meaning the data is written without being checked. Thus on a write if the media is bad, you won't know it until the host does a read to that sector. There is a small part of the sector (the sector header) which is read on writes to locate the correct sector, so that if there is an error in reading the sector header, the drive will reassign the sector and write it with the data received from the write command. But the vast majority of bad blocks are detected on reads, and just because a write succeeds to a sector doesn't mean the media is good or that the sector has been reassigned.

Now about bad block reassignment (also called reallocation). Yes, normally the drive will attempt to reassign a sector if the error is bad enough (i.e., the ECC failure is bad enough) but the drive still could recover the data after ECC correction. Usually this is done automatically. The only exception is that the host could have previously told the drive not to do automatic reallocations, but this is seldom done.

So what happens if the drive does a read and cannot recover the data? Nothing. The error is reported to the host, but no reassignment is done. The problem is that the drive could reassign the sector, but it doesn't have the slightest idea what data to write in the newly reassigned sector. If it just wrote a bunch of zeros, say, and then the sector was read again, it would return all the zeros without any indication that the data wasn't valid. This is essentially the same thing as data corruption. The drive can't count on the host keeping track of errors for a variety of reasons (for example, what if the drive was moved to a new host?), so the best course of action is to do nothing when the data can't be recovered.

Modern drives, however, will save the location of the bad sector when it can't be reallocated. The number of bad sectors waiting reallocation can be found in the SMART data. What happens is if a write is done to one of the bad sectors awaiting reallocation, the reallocation is done because the drive now has valid data to write to it after the reallocation. Thus when people say writing to a bad sector will reallocate it, that's really only half the story. The drive must be read first so the drive can discover all the bad sectors that can't be reallocated automatically. Thus you can write an entire drive, and the SMART data will say there are no bad sectors waiting reallocation, but you haven't necessarily cleared the drive of all bad sectors. So if you really want to clear a drive of all bad sectors, the best thing is to read the entire drive first, followed by writing the entire drive (of course, this will destroy all previous data on the drive).

There are other ways of dealing with bad blocks which can't be reallocated. If the drive is part of a redundant RAID configuration (i.e., anything but RAID 0), the RAID software should automatically recover the data for a bad sector from the other drives and write it to the reallocated sector. SCSI disks have an explicit reassign blocks command which the host can use to force the reassignment even when there is no valid data to write to the block, but its use is pretty low-level.

Solution 3:

I think all you have to do is:

e2fsck -c /dev/hda1

assuming /dev/hda1 is the (unmounted) partition. Or:

e2fsck -c -c /dev/hda1

to do a (slower) non-destructive read-write test. It will still have to be unmounted. I don't think this will give you details on any lost data, though.

Solution 4:

Michael has it correct and under most cases I would say just replace the drive they are cheap. However if you don't have backups and can't get important data off the drive, or just want to attempt to repair the drive then you may want to try using spinrite, on the highest level.

I had a laptop drive that started making some noises a few years ago. Badblocks showed that the drive had 118 or so bad blocks visible to the end user. Since I already had a copy of SpinRite I decided to give it a try before buying a new drive. After running spinrite on the drive badblocks showed 0 bad blocks and the noises stopped. The drive had been working for over two years since then.