How to use hdparm to fix a pending sector?
SMART is stating one pending sector on of my server's hdd. I've read lot's of articles recommending using hdparm to "easily" force the disk to relocated the bad sector, but I can't find the correct way to use it.
Some info from my "smartctl":
Error 95 occurred at disk power-on lifetime: 20184 hours (841 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 d7 55 dd 02 Error: UNC at LBA = 0x02dd55d7 = 48059863
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 d6 55 dd e2 00 18d+05:13:42.421 READ DMA
27 00 00 00 00 00 e0 00 18d+05:13:42.392 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 18d+05:13:42.378 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 18d+05:13:42.355 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 18d+05:13:42.327 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 20194 48059863
# 2 Short offline Completed without error 00% 15161 -
With that "bad LBA" (48059863) in hand, how do I use hdparm? What type of address the parameters "--read-sector" and "--write-sector" should have?
If I issue the command hdparm --read-sector 48095863 /dev/sda it reads and dumps data. If this command was right, I should expect an I/O error, right?
Instead, it dumps data:
$ ./hdparm --read-sector 48059863 /dev/sda
/dev/sda:
reading sector 48059863: succeeded
4b50 5d1b 7563 a932 618d 1f81 4514 2343
8a16 3342 5e36 2591 3b4e 762a 4dd7 037f
6a32 6996 816f 573f eee1 bc24 eed4 206e
(...)
If for whatever reason you prefer to try to clear those bad sectors, and you do not care about the existing contents of a drive, the below shell snippet may help. I tested this on an older Seagate Barracuda drive that is well past its warranty anyway. It might not work right with other drive models or manufacturers, but it should put you on the right path if you must script something. It will destroy any content you have on the drive.
You may prefer just running badblocks, an hdparm Secure Erase (SE) (https://wiki.archlinux.org/index.php/Securely_wipe_disk), or some other tool that is actually designed for this. Or even the manufacturer provided tools like SeaTools (there is a 32bit linux 'enterprise' version, google it).
Make sure the drive in question is completely unused/unmounted before doing this. Also, I know, while loop, no excuses. It is a hack, you can make it better...
baddrive=/dev/sdb
badsect=1
while true; do
echo Testing from LBA $badsect
smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
echo "Waiting for test to stop (each dot is 5 sec)"
while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do
echo -n .
sleep 5
done
echo
badsect=$(smartctl -l selective ${baddrive} | awk '/# 1 Selective offline Completed: read failure/ {print $10}')
[ $badsect = "-" ] && exit 0
echo Attempting to fix sector $badsect on $baddrive
hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
echo Continuning test
done
One advantage of using the 'selftest' method is the load is handled by the drive firmware, so the PC it is connected to is not loaded down like it would be with dd or badblocks.
NOTE : I'm sorry, I made a mistake, the correct while condition is like this :
while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print $4}')" = "Self_test_in_progess" ]; do
And the exit condition of the script becomes :
[ $badsect = "-" ] || [ "$badsect" = "" ] && exit 0
I think it may have read without error because that sector is not bad, but other tools fail reading the sector because of some other behavior. (read ahead that reaches an actually unreadable sector?)
I found some bad sectors, and if I repair the only one that is unreadable with "hdparm --read-sector", the other 'bad' sectors suddenly are no longer unreadable with things like dd. And interestingly, when looking at "dmesg" output, only the hdparm-unreadable ones are ever reported.
eg. I had sectors 36589320 to 36589327, and 36589344 to 36589351 unreadable with dd, but only 36589326 and 36589345 were unreadable with hdparm --read-sector. Then I used hdparm --write-sector on those 2, and then all 16 sectors were readable again.
Here's a small part of dmesg output:
[30152036.527940] end_request: I/O error, dev sda, sector 36589326
[30152077.363710] end_request: I/O error, dev sda, sector 36589345
And the disk info:
# smartctl -i /dev/sda
...
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA MK2002TSKB
...
Firmware Version: MT2A
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
...
And apparently this disk's firmware either doesn't properly record reallocated sectors, or they weren't really reallocated, but just corrupt (like an unrecoverable ECC error, but the surface still works, like it was caused by bit rot rather than faulty electronics or bad media):
# smartctl -A /dev/sda | egrep "Reallocated|Pending|Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
# smartctl -l error /dev/sda
...
SMART Error Log Version: 1
No Errors Logged
Please note, I ran a --read-sector and a --write-sector. A read may be required to properly reallocate a sector, not just a write. If you don't read first, it might not know the sector is bad.
based on @Glenn's answer you'll find the script fixbad at
http://wiki.bitplan.com/index.php/Bad_Block_Howto
as of 2020-09-10 the content of the script is:
#!/bin/bash
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
# WF 2020-10-04
disk=/dev/sdb
mode=short
# verbose
verbose=false
# should commands only be shown?
dry=false
# should write fixes be performed?
fix=false
# range of sectors to modify after bad sector
range=8
# set to sudo if sudo is needed
sudo=sudo
# serial number
serial="-?-"
#ansi colors
#http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html
blue='\033[0;34m'
red='\033[0;31m'
green='\033[0;32m' # '\e[1;32m' is too bright for white bg.
endColor='\033[0m'
#
# a colored message
# params:
# 1: l_color - the color of the message
# 2: l_msg - the message to display
#
color_msg() {
local l_color="$1"
local l_msg="$2"
echo -e "${l_color}$l_msg${endColor}"
}
#
# error
#
# show an error message and exit
#
# params:
# 1: l_msg - the message to display
error() {
local l_msg="$1"
# use ansi red for error
color_msg $red "Error: $l_msg" 1>&2
exit 1
}
#
# show the usage
#
usage() {
echo "usage: $0 [disk]"
echo " [-c|--check]"
echo " [-d|--dry]"
echo " [-h|--help]"
echo " [-i|--info]"
echo " [[-m|--mode] mode]"
echo " [[-r|--range] range]"
echo " [[-s|--serial [serial]]"
echo " [-t|--test]"
echo " [[-w|--wait [type]]"
echo " [-v|--verbose]"
echo
echo " -h|--help: show this usage"
echo " -c|--check: check the disk"
echo " -d|--dry: dry run - show commands only"
echo " -i|--info: show info about the given disk"
echo " -m|--mode: set mode: default=short"
echo " -r|--range: range of sectors to modify after bad sector"
echo " -s|--serial: get serial number of confirm serial number"
echo " -t|--test: run test for the given type e.g. selective selftest"
echo " -w|--wait: wait for the result of the given testype e.g. selective selftest"
echo " -v|--verbose: set verbose mode"
echo ""
echo "example:"
echo " $0 /dev/sdb -i"
echo ""
echo "for any write operation you need to confirm the serial number"
echo "to get serial number: "
echo " $0 disk -s "
exit 1
}
#
# get a number range from 0 to the given n-1
#
# params
# 1: n
function getRange() {
local l_n="$1"
range=$(python -c "for i in range($l_n): print i,")
echo $range
}
#
# read the result of the smartctl test for the given disk
#
# params
# 1: l_disk: the disk under test e.g. /dev/sdb
# 2: l_type: the type of the test e.g. selective
function readResult() {
local l_disk="$1"
local l_type="$2"
$sudo smartctl -l $l_type $l_disk | egrep "^#?[[:space:]]*[0-9]"
}
#
# show the Result
#
function showResult() {
local l_logline="$1"
local l_logstatus="$2"
if [ "$verbose" == "true" ]
then
echo $l_logstatus:$l_logline
else
echo $l_logline | gawk '
/#/ {
print $0; exit
}
{
status=substr($4,1,9)
progress=$5;
gsub("\\[","",progress);
range=$7
printf("\r%s",progress);
}'
fi
}
#
# wait for the result of a running selftest
#
# param 1: l_disk: the disk under test e.g. /dev/sdb
# param 2: l_type: the type of the test e.g. selective
# param 3: l_wait: number of seconds to wait
#
function waitForResult() {
# example
#=== START OF READ SMART DATA SECTION ===
#SMART Selective self-test log data structure revision number 1
#SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
# 1 7814037167 Self_test_in_progress [90% left] (2564632-2630167)
local l_disk="$1"
local l_type="$2"
local l_wait="$3"
local l_logline=""
local l_logstatus=""
color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)"
while [ "$l_logstatus" != "Completed" ]; do
l_logline=$(readResult "$l_disk" "$l_type" | egrep "^#?[[:space:]]*1")
l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }')
showResult "$l_logline" "$l_logstatus"
sleep $l_wait
done
}
#
# get the serial number of the device
#
function getSerialNumber() {
local l_disk="$1"
serial=$($sudo smartctl -i $l_disk | grep "Serial Number" | cut -f 2 -d':')
echo $serial
}
#
# get the blocksize of the given file system
#
function getBlockSize() {
local l_fs="$1"
blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':')
echo $blocksize
}
#
# get the partition for the given disk
#
function getPartition() {
local l_disk="$1"
fs=$(mount | grep $l_disk | cut -f1 -d' ')
echo $fs
}
#
# get the start sector for the given disk
#
function getStartSector() {
local l_disk="$1"
local l_fs="$2"
startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ')
echo $startsector
}
#
# get Info about the given disk
#
function getInfo() {
local l_disk="$1"
$sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)"
$sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
fs=$(getPartition $l_disk)
if [ "$fs" != "" ]
then
color_msg $blue "Partition: $fs"
blocksize=$(getBlockSize $fs)
color_msg $blue "Blocksize: $blocksize"
else
color_msg $red "couldn't find mounted partition for $l_disk"
fi
}
#
# geh the current pending sector for the given disk
#
function getCurrentPendingSector() {
local l_disk="$1"
# if msg is empty don't show message but only return the current pending sector count
local l_msg="$2"
psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector)
psector=0
if [ $? -eq 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi
psector=$(echo $psectorline | cut -f 10 -d ' ')
if [ $psector -gt 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi
else
if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi
fi
else
if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi
psector=-1
fi
if [ "$l_msg" == "" ]; then echo $psector; fi
}
#
# fix the given bad sector on the given disk with the given range of sectors to fix
#
# param 1: disk e.g. /dev/sdb1
# param 2: defect sector to repair
# param 3: range - range of sectors to repair e.g. 8
#
fixBad() {
local l_disk="$1"
local l_sector="$2"
local l_range="$3"
color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
r=$(getRange $l_range)
for i in $r ; do
let b1=$l_sector+$i
if [ "$dry" == "true" ]
then
echo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $l_disk
else
$sudo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $disk >> /tmp/smart_repaired.log
fi
done
#tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20
#grep '#' /tmp/smart | head -5
#hdparm -I $disk > /tmp/hdparm
}
#
# check the needed software
#
checkSoftware() {
for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo
do
bin=$(which $sw)
if [ $? -eq 0 ]
then
if [ "$verbose" == "true" ]
then
color_msg $green "will use $bin as $sw"
fi
else
error "$0 needs $sw to work please install it"
fi
done
}
#
# run a test for the given disk in the given mode
#
# params
# 1: l_disk: the disk under test e.g. /dev/sdb
# 2: l_mode: the mode of the self test e.g. short/long
function runTest() {
local l_disk="$1"
local l_mode="$2"
color_msg $blue "running $l_mode smartctl test for $l_disk ..."
$sudo smartctl -t $l_mode $l_disk > /tmp/null
}
#
# check the given disk in the given mode
#
function checkDisk() {
local l_disk="$1"
local l_mode="$2"
local l_serial="$3"
fs=$(getPartition $l_disk)
blocksize=$(getBlockSize $fs)
startsector=$(getStartSector $l_disk $fs)
color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector"
getCurrentPendingSector "$l_disk" show
psector=$(getCurrentPendingSector "$l_disk")
if [ $psector -gt 0 ]
then
runTest $l_disk $l_mode
fi
}
#
# check the lba block
#
function lbaCheck() {
local l_disk="$1"
fs=$(getPartition $l_disk)
blocksize=$(getBlockSize $fs)
startsector=$(getStartSector $l_disk $fs)
diskserial=$(getSerialNumber $l_disk)
readResult "$l_disk" selftest | while read line
do
echo $line | grep "read failure" > /dev/null
if [ $? -eq 0 ]
then
if [ "$verbose" == "true" ]
then
echo $line
fi
index=$(echo $line | cut -f2 -d' ')
state=$(echo $line | cut -f3-4 -d ' ')
progress=$(echo $line | cut -f8 -d ' ')
lba=$(echo $line | cut -f10 -d ' ')
if [ "$lba" == "" ]
then
lba=0
fi
if [ "$lba" -gt 0 ]
then
echo $index $state
echo "progress: $progress"
echo "lba: $lba"
# calculate the file system block
fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
echo "file system block: $fsb"
if [ "$fix" == "true" ]
then
if [ "$serial" != "$diskserial" ]
then
color_msg $red "you need to provide the serial number of $l_disk to perform fix operations"
else
fixBad $l_disk $lba $range
fi
fi
fi
fi
done
}
#
# try Fixing bad sectors
#
function tryFix() {
local l_disk="$1"
badsect=$($sudo smartctl -l selective ${baddrive} | gawk '/# 1 Selective offline Completed: read failure/ {print $10}')
[ $badsect = "-" ] && exit 0
echo Attempting to fix sector $badsect on $baddrive
echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
}
#
# start a check loop on the given drive
#
function checkLoop() {
local baddrive="$1"
badsect=1
while true; do
color_msg $blue "Testing $baddrive from LBA $badsect"
$sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
waitForResult $baddrive selective 5
tryFix $baddrive
color_msg $blue "running next test"
done
}
# make sure the needed software is available
checkSoftware
# commandline option
while [ "$1" != "" ]
do
option=$1
shift
case $option in
-h|--help)
usage
;;
-i|--info)
getInfo $disk
;;
-m|--mode)
if [ $# -lt 1 ]
then
usage
else
mode=$1
shift
fi
;;
-c|--check)
checkDisk $disk $mode $serial
;;
-d|--dry)
dry=true
;;
-l|--loop)
checkLoop $disk
;;
-f|--fix)
fix=true
;;
-r|--range)
if [ $# -lt 1 ]
then
usage
else
range=$1
shift
fi
;;
-s|--serial)
if [ $# -lt 1 ]
then
getSerialNumber $disk
exit 1
else
serial=$1
shift
fi
;;
-t|--test)
runTest $disk $mode
;;
-v|--verbose)
verbose=true
;;
-w|--wait)
if [ $# -lt 1 ]
then
usage
else
type=$1
shift
waitForResult $disk $type 5
fi
;;
-x)
lbaCheck $disk $serial;;
*)
disk=$option
;;
esac
done
Personally i wasn't able to get any meaningful results aka "repair" a disk with this toolkit. Still the script and it's part are helpful in analyzing and attempting fixes. Beware of using the script in the hope of "full automation". You might loose your data instead of fixing it.