Getting a RAID controller to surface scan on a sane schedule
The controller I'm presently working with is quite old, the HP Smart Array P400; in part I want to know how to deal with that controller, but I'm also interested in the general perspective -- if there are other/newer controllers that handle this better, how do they handle it? I'm looking ideally for OS-neutral solutions, but if that doesn't work, it's running VMware ESXi.
There are basically two settings for surface scan on this controller: high, or idle with a configurable delay in seconds.
For years it's been on idle with a 3 second delay. (Not sure why, this was probably the default.) However, I recently got concerned that this means it basically never runs the surface scan, since even during periods of very little actual use, ESXi sends "heartbeat" I/O more frequently than that, and most of the guest OSes also send little blips of one kind or another during idle time.
Figuring it's a bad idea to effectively have the controller never do a surface scan, I picked the only other option, "high".
There might be some kind of performance penalty here, but this array's workload is just system disks for the VMs, not data disks (I use ZFS on a plain HBA for that), so nobody's noticed thus far.
My concern is that, now the drives won't stop, period. I've had this setting for several days, and over those days there have been plenty of idle periods such that I figure the controller could probably have done a complete scan by now. I can do a ZFS scrub on a pool 7 times larger and on lower RPM drives in less time. I've peeked at the server a number of times during idle periods and not once have I seen it without the disk lights dancing around like a music video.
It seems like it has the scan on an infinite loop, without any kind of delay in between scans. Am I correct here?
This to me seems kind of ridiculous. I would have hoped that once the controller managed to get through a scan, it would stop for a few days at least before starting the next one. I really doubt sectors degrade quick enough to justify constant scanning.
I'm worried that this is going to kill off drives way faster. These are 2.5" 10k SAS disks, 300GB and 600GB, in RAID 1+0. Is this a valid concern? I'm guessing this setting has increased total daily disk activity by at least ten times.
Now, disks constantly spin regardless of access, heads don't actually touch platters, and the actuator is moved by a contactless electromagnetic system. So I think the only big difference in wear-out would be on the actuator axis bearing, when the disk seeks. In principle that sounds pretty minor, but in practice it does seem that lots of seeks wear drives out faster.
I imagine this scan is accessing sectors sequentially, which, in of itself, wouldn't involve tons of actuator movement. However, if the scan is being frequently interrupted by little idle accesses that need the heads to be somewhere else, that could at worst amplify that back-and-forth significantly.
(I should perhaps look at migrating to SSDs, but in any case I don't want to kill off the magnetic disks already installed.)
To summarize, my questions are:
-
Is it actually going to scan continually?
-
Is there some way to make this scanning periodic instead of continuous? (If not on this controller, even on any different ones?)
-
Should I actually be worried about this wearing out the disks?
Geez... That's a lot of effort.
Disks are consumable. If one fails, let it fail.
The HP SmartArray will tell you and you can replace the drive as intended.
Replacement disks are cheap for that era of server (2007-2009), so you shouldn't overthink how these background processes work.
I would not use the high
setting for extended periods because it can impact IO performance.
From HP Smart Array manual:
SurfaceScanMode
This parameter specifies the Surface Scan Mode with the following values: High—The surface scan enters a mode guaranteed to make progress despite the level of controller I/O.
In other words, the controller will not prioritize real IO vs scan/scrub one. I suggest you leaving the default medium
setting: if disks are constantly accessed by your application, it probably needs the required performance.
If bit rotting worries you, surface scan can be sporadically set to high
(ie: during one weekend each month) but, as suggested by others, I would not bother changing the default setting.