How reliable is HDD SMART data?
In my experience (20 years in operating servers, must have handled about 5.000 disks in all the servers I have dealt with) SMART is useful but no panacea.
If you get SMART errors replace the disk asap. Chances hare very high that with 4-8 weeks the disk will have serious issues. (The Google study frequently mentioned in this regard correlates very nicely with my personal experience.)
Typically you have a week or 2 before the disk becomes really problematic.
If you don't get SMART errors at all, the disk can still fail without any warning whatsoever, although that is quite rare in servers. I see may be 3 or 4 such cases per year. While we replace drives because of SMART errors at about 25/month.
This may have to do that server disks are usually part of a raid array and see a continuous read/write pattern all over the disk. This gets every part of the disk "exercised" (and checked) on a regular basis.
Biggest chance of a disk failing (without previous warning) is on startup if a server has been switched of for some time after been continuously run for months/years.
In consumer equipment (non-server, laptop/desktop-drives) I have seen plenty disks with read-errors that somehow didn't end up in SMART data, even though Windows logged those errors in the Event log. (SMART only did log them after a full chkdsk from Windows.)
This leads me to believe that, in many consumer drives, the SMART thresholds are quite low. This might be (big IF) intentional to keep RMA numbers low in this cut-throat business.
Many consumers will not notice the occasional bad block anyway until it is too late. (How many consumers know where to find the Event log ? That's the only place where you can see disk-errors in Windows.)
In my experience if a consumer disk has issues (SMART or otherwise), copy your data of it and replace it immediately. By the time it gives those errors it is already past dead.
It's fairly reliable, but it doesn't cover all of the types of failure modes a drive can have.
Using some form of RAID will help protect you in a greater number of scenarios.
I'd say that across my servers, only 20% of my disk failures are the result of S.M.A.R.T. data.
The way HP servers present this information makes it seem that there are variety of metrics used to determine disk health and failure status.
There is an article about a study done on HDDs used at Google that is available from http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/disk_failures.pdf which essentially concludes that
Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure robabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.
There have been other studies done that also conclude that the SMART data is useful but has its limitations on predicting drive failure.
My personal experience is that the overall smart status waits for far too many errors before it flags the drive as bad. The individual parameters are useful and some are dire warnings of immanent failure at their first error, but not all of them are conclusive proof a drive is bad or going bad soon. Read http://en.wikipedia.org/wiki/S.M.A.R.T. for a description of what most common attributes are.
Though if the overall SMART status is bad then the disk is definitely bad.
In a RAID that has redundant disks if I perform 2 consistency check and fixes in a row and it finds errors on the 2nd pass then I will not use that HDD anymore and try for a warranty replacement. My reasoning is that the 1st pass finds and fixes any inconsistencies and the 2nd pass should come back clean. If it does not, it means that HDD is unable to store data for even those few hours.