Using “badblocks” on modern disks

Solution 1:

Question 1:

With regards to the -b option: this depends on your disk. Modern, large disks have 4KB blocks, in which case you should set -b 4096. You can get the block size from the operating system, and it's also usually obtainable by either reading the disk's information off of the label, or by googling the model number of the disk. If -b is set to something larger than your block size, the integrity of badblocks results can be compromised (i.e. you can get false-negatives: no bad blocks found when they may still exist). If -b is set to something smaller than the block size of your drive, the speed of the badblocks run can be compromised. I'm not sure, but there may be other problems with setting -b to something smaller than your block size, since it isn't verifying the integrity of an entire block, it might still be possible to get false-negatives if it's set too small.

The -c option corresponds to how many blocks should be checked at once. Batch reading/writing, basically. This option does not affect the integrity of your results, but it does affect the speed at which badblocks runs. badblocks will (optionally) write, then read, buffer, check, repeat for every N blocks as specified by -c. If -c is set too low, this will make your badblocks runs take much longer than ordinary, as queueing and processing a separate IO request incurs overhead, and the disk might also impose additional overhead per-request. If -c is set too high, badblocks might run out of memory. If this happens, badblocks will fail fairly quickly after it starts. Additional considerations here include parallel badblocks runs: if you're running badblocks against multiple partitions on the same disk (bad idea), or against multiple disks over the same IO channel, you'll probably want to tune -c to something sensibly high given the memory available to badblocks so that the parallel runs don't fight for IO bandwidth and can parallelize in a sane way.

Question 2:

Contrary to what other answers indicate, the -w write-mode test is not more or less reliable than the non-destructive read-write test, but it is twice as fast, at the cost of being destructive to all of your data. I'll explain why:

In non-destructive mode, badblocks does the following:

  1. Read existing data, checksum it (read again if necessary), and store it in memory.
  2. Write a predetermined pattern (overrideable with the -p option, though usually not necessary) to the block.
  3. Read the block back, verifying that the read data is the same as the pattern.
  4. Write the original data back to the disk.
    • I'm not sure about this, but it also probably re-reads and verifies that the original data was written successfully and still checksums to the same thing.

In destructive (-w) mode, badblocks only does steps 2 and 3 above. This means that the number of read/write operations needed to verify data integrity is cut in half. If a block is bad, the data will be erroneous in either mode. Of course, if you care about the data that is stored on your drive, you should use non-destructive mode, as -w will obliterate all data and leave badblocks' patterns written to the disk instead.

Caveat: if a block is going bad, but isn't completely gone yet, some read/write verification pairs may work, and some may not. In this case, non-destructive mode may give you a more reliable indication of the "mushiness" of a block, since it does two sets of read/write verification (maybe--see the bullet under step 4). Even if non-destructive mode is more reliable in that way, it's only more reliable by coincidence. The correct way to check for blocks that aren't fully bad but can't sustain multiple read/write operations is to run badblocks multiple times over the same data, using the-p option.

Question 3:

If SMART is reallocating sectors, you should probably consider replacing the drive ASAP. Drives that lose a few sectors don't always keep losing them, but the cause is usually a heavily-used drive getting magnetically mushy, or failing heads/motors resulting in inaccurate or failed reads/writes. The final decision is up to you, of course: based on the value of the data on the drive and the reliability you need from the systems you run on it, you might decide to keep it up. I have some drives with known bad blocks that have been spinning with SMART warnings for years in my fileserver, but they're backed up on a schedule such that I could handle a total failure without much pain.

Solution 2:

1) If your modern disk uses sector size other then 512b - then you need to set that size with -b option (i.e. -b 4096). Without that option your check will run much slower as each real sector will be tryied multiple times (8 times in case of 4k sector). Also as mentioned Olivier Dulac in comment to question - block is indeed 1 block, and not 1/2 or 1/4th or even 2 (or more) blocks.

Option -c imply on how many sectors tryid at once. It could have some implication on performance and value of that performance could depend on specific disk model.

2) write-mode test - In my understanding it will only check if you have hard-bad error or soft-bad error (aka Silent Data Degradation, bit rot, decay of storage media, UNC sectors)

3) I would not trust to SMART report at point in time. It is more important how values changes through time. Also here is research by Google Failure Trends in a Large Disk Drive Population and here is some discussion of it. Here is cite from research:

Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.

Regarding mentions by other for disk replacement - you may have not hard-bad disk problem but Silent Data Degradation (bit rot, decay of storage media,UNC sectors). In that case it has no sense to replace disk, but instead it is useful to perform read/write of same data back to disk. You could look here how it could be resolved.

If you have hard-bad error you could try to repartition drive in the way that bad area is located out of any partitions. For me that approach was useful and such bad drive was used for long time without any problems.

Solution 3:

I would leave -b and -c as default unless you have a specific reason to change them. You could probably set -b to 4096 if your disk has 4k block sizes.

I would suggest you first run badblocks with non-destructive rw test. If it finds any bad sectors, the disk is broken and should be replaced. If it does NOT find any bad blocks on non-destructive, but you still suspect it to have badblocks, then run the destructive rw test.

Lastly how many SMART sector re-allocations are acceptable / should drives with non-zero reallocation counts be immediately replaced?

I would replace the drive as soon as sectors are being replaced.