HDD is acting up, but S.M.A.R.T says everything is fine

Solution 1:

You are most likely having a disk problem. The disk is failing and one fairly common failure method is to have higher latencies due to increased number of retries on certain problematic areas on the disk, these areas when hit will cause a chain reaction of other IOs waiting on them and if there were multiple IOs to the affected area you'll see such a problem as there will be multiple IOs blocking for >10 seconds.

I can recommend testing the disk with diskscan it will show you the latency graph across the disk. It can work in read-only mode so it is not destructive at all. You can also ask it to fix areas that are readable but very slow, but first test the disk to see how it behaves.

It is possible that the problem is intermittent and so will not be noticed by diskscan. You can run iosnoop to collect histories of all IOs and their latencies. The script adds some overhead but works very nicely. It may need some scripting around for a longer logging session if the problem only happens infrequently.

You can increase the scsi subsystem logging level to try to get more information out of the kernel, if you use an LSI SAS HBA to access the disks you can increase the logging level of the mpt2sas driver to get more info out of it as well. Both can help seeing if there are timeouts and aborts in the kernel. Check to see if you can see log messages in the kernel pertaining timeouts and aborts already, they may serve as another clue.

Edit 1:

To enable SCSI debug logging you can use the command: echo 9411 > /proc/sys/dev/scsi/logging_level you may need to use a different location for the sys file.

Also try to run smartctl with -x option it will show a few last errors if there are any.