SATA hdd errors

In my experience the errors you're seeing are actually hardware errors reflected in software. The 'lost page write due to I/O error' message is one I've seen with bad hard-drives, and it behaves similar to how you describe when attempting to fsck it. This is almost definitely a true hardware fault.

You should check the output of smartctl to see what it says could be problem.

smartctl --attributes /dev/sdb

It'll give you output similar to this:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED     RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   212   186   021    Pre-fail  Always       -       4358
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   066   066   000    Old_age   Always       -       25420
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       86
194 Temperature_Celsius     0x0022   104   001   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

The output can be arcane, but the one I'd pay close attention to would be Reallocated_Sector_Ct, since that tells you what the HD has for known bad sectors. The command 'smartctl -a' will give a lot more data. On the bad HD I had a while back, the bottom of that output is the 'SMART Error Log' which had a few entries.


You had an uncorrectable read error.

Error: UNC at LBA = 0x03800922 = 58722594

The data that was on that block is now lost.

You should:

  • be using a mirror in the first place. Enterprise disks are actually intended to be behind a mirror and they would rather return a read error than try really hard to get the data.
  • recover the lost data from backups

You have NO EXCUSE to not be using RAID (especially if you host website for clients!) - the OS is not that large, you don't need a dedicated disk for it on a 2-disk system.