Strange Recurrent Excessive I/O Wait

(I assume that your disks are directly attached into the server, and not over NFS, for example.)

What is important is that your svctm (in iostat output) is extremely high, which suggests hardware problem with RAID or disks. Svctm for normal disks should be around 4 (ms). May be less, but not too higher.

Unfortunately, smartctl output is not informative in your case. It have errors corrected but this could be normal. Long test seems to be completed OK, but that's inconclusive again. ST3500620SS seems to be good old server/raid type disk, which should respond quickly on read errors (unlike desktop/non-raid disks), so this could be more complicated hardware problem than just bad sectors. Try to find something unusual (like high error rates) in RAID statistics: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

My suggestion is disks replacement should be next step.


Update:

Svctm is more important factor, as high util% is just consequence of svctm being abnormally high.

I saw similar problem when desktop disks was installed into Promise RAID. Desktop disks designed to try to repair read errors by many long retries, which contributes into latency (these read errors could be because of some other factor, such as vibration, which is much stronger in server room than in desktop). Unlike that, disks designed to be used in RAID just report quickly any errors to RAID controller, which can correct them with RAID reduncancy. Plus, server disks could be designed to be more mechanically resitant against constant strong vibration. There is common myth that server disks are same as desktop just being more expensive, which is wrong, they are actually different.

Q: Ah, what I wanted to ask: if it's a hardware problem, don't you think that the problem should be continually visible and not disappear for some time? Do you happen to have any explanation for that effect?

A:

  1. Problem may always be there but it become noticeable only on high load.
  2. Vibration levels could be different in different time of day (depending, for example, on what nearby servers do). If your problem is disks being affected by vibration it's definitely could disappear and reappear. I saw similar behavior when I had my 'desktop disks' problem. (Of course, your disks is server ones and recommended for RAIDs, so it's not exactly the same problem. But it could be similar.)