Why do damaged hard drives freeze the entire system?
Why does a hard drive which is known to have bad blocks (verified in HDTune and HDDScan), freezes down my entire system?
It is not the OS drive; it is attached to another SATA port, and I'm trying to copy files from it to another healthy drive.
I have experienced this issue with almost every damaged hard drive and every Windows PC.
I would expect to see freezing only for the program I'm using to copy the files (Windows Explorer, etc.), but instead my entire PC gets jerky, and I cannot browse web or watch movies while copying files from the damaged drive.
The long story.
I live in a rural area where there are problems with electricity (brownouts, etc.). I myself am using a UPS and my own hard drives are perfectly fine. But my neighbors often ask for help with their PC issues, and I often find that their hard drives are damaged, most probably because of electricity issues. Of course, after replacing the damaged drive I suggest my neighbors to buy an UPS.
I have always wondered, why my PC freezes entirely while retrieving data from damaged drives. Is it a hardware issue? Is it caused by the way OS reads data? Is it something Windows-specific, and I won't experience it on *nix?
Anyway, from now on I will use some dedicated software (such as Roadkil's Unstoppable Copier) instead of Windows Explorer, although I'm not sure if this will work differently, without freezing entire PC.
It is not a request for help, it is more for educational purposes, so I know why the things work that way.
This is one of those areas where SATA is suboptimal. The problem is at the storage device interconnect protocol level, and thus not related to what software you are running. Using another file copier or another operating system won't magically make things better, except that it might try to set different timeout values to reduce the impact of the problem (which may or may not be possible depending on the hardware and firmware; see below).
There are a few important points here:
- With SATA, if the drive stops responding, this can tie up the whole storage system, not just the one drive that is having problems. It certainly has the potential to tie up the whole controller, and since most consumer systems have only a single disk controller (the one integrated on the motherboard), this means all storage. It's even worse if the drive fails in some non-standard and/or unexpected way, which can certainly happen if the drive is marginal. You may be interested in How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt? on Server Fault.
- Most consumer SATA drives have long default timeout periods (on the order of minutes) and many consumer SATA drives lack configurable error recovery control. So-called "NAS" drives often have configurable ERC, and high-end drives virtually always do; such drives may also have shorter default timeouts (7 seconds being a common value). Long timeout periods are advantageous if the drive holds the only copy of the data, which unfortunately is common on consumer systems; they are a disadvantage in a redundant configuration or where you simply want to get as much as possible off the drive before it deteriorates further.
- A drive will keep trying to read a bad sector until it reaches its timeout threshold or until an abort is signalled by the host. Since the SATA bus can be tied up by the wait for the read to finish, it might not be possible for the OS to signal a storage-level command abort, and in extreme cases, drives might not even respond well to a SATA bus reset in such a situation.
Point #1 is one of the main selling points for SAS on servers; SAS has significantly better error handling than SATA. Point #2 is a drive firmware limitation, and #3 becomes a problem really only because of #2.
So what happens is that the OS issues a "read sectors" command to the disk, and the particular sectors are somehow damaged. Thus, the disk goes into retry mode to try to get the data off the platters, trying the read again and again until it gets good enough data that the disk's own error correction (FEC) is able to correct for the remaining errors. If you are unlucky, this might be never, but the drive will keep trying for some fairly long period of time before deciding that this read isn't going to succeed.
Because the operating system is waiting for the read, this will at the very least slow down the copying process to a crawl, and depending on the exact OS architecture can cause the OS to become jerky or even freeze for the duration. The disk, at this point, is busy with the original read and won't respond to further read commands until the one that is currently executing ends (successfully or unsuccessfully), and other software generally won't do better than the operating system it is running on.
Hence, anything that triggers a read elsewhere (ideally, only on the damaged drive) is going to have to wait in line until the damaged drive either successfully reads the sector in question, or determines that it cannot be read. Because of SATA's less than optimal handling of nonresponsive drives, this can mean that not only the drive you are copying from is going to have its I/O delayed. This can very easily cause other software to become slow or unresponsive as well, as that software waits for a different I/O request to finish, even if the operating system is able to cope.
It's also important to note here that disk I/O can happen even though you aren't explicitly accessing any files on disk. The two main causes for this would be load-on-demand executable code, and swap. Since swap is sometimes used even when the system is not under memory pressure, and load-on-demand executable code is common on modern systems and with modern executable file formats, unintended disk read activity during normal use is a very real possibility.
As pointed out in a comment to the question by Matteo Italia, one mitigative strategy is to use a different storage interconnect, which is a complicated way of saying "put the disk in a USB enclosure". By abstracting through the USB mass storage protocol, this isolates the problematic SATA portion from the rest of your system, which means that in theory, only I/O on that specific disk should be affected by I/O problems on that disk.
As a bit of an aside, this is pretty much why SATA (particularly, SATA without drive-level ERC) is often discouraged for RAID (especially RAID levels with redundancy, which among the standard ones is all except RAID 0); the long timeout periods and poor error handling can easily cause a whole device to be thrown out of the array for a single bad sector, which the RAID controller could handle just fine if redundancy exists and the storage controller simply knows that this is the problem. SAS was designed for large storage arrays, and thus with the expectation that there will be problems on various drives occasionally, which led to it being designed to handle the case of a single problematic drive or I/O request gracefully even if the drive doesn't. Problematic disks are not very common in consumer systems simply because those tend to not have many disks installed, and the ones that are installed virtually never have redundancy; since SATA aimed to replace PATA/IDE not SCSI (the latter being the niche SAS aimed for), it is likely that its error handling features and demands (or guarantees) were considered adequate for its intended use case.
As was stated above, the issue with system freezes due to a bad hard drive is primarily due to long attempts by the drive to recover unreadable data from bad sectors. One of the selling points of enterprise drives is the very short read timeout for failed sectors. Using an enterprise drive can mitigate your issues to some degree, but will not solve them.
The best answer, moving forward, is to maintain proper backups so that recovery isn't required. Changing recovery software will not make a difference as this is a firmware timeout issue.
Why do damaged hard drives freeze entire system?
They don't have to (in general). It's really depending on the particular file system how a disk failure is dealt with.
Consider ZFS, which is designed from the ground up to deal with quite some fault tolerance. Here's a demo video (and one with more explaining) where they place running drives on an anvil, take a swing with a sledge hammer and drill another drive. All while ZFS keeps running.