Do RAID controllers commonly have SATA drive brand compatibility issues?

We've struggled with the RAID controller in our database server, a Lenovo ThinkServer RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the ServeRAID 8k.

We have patched this ServeRAID 8k up to the very latest and greatest:

  • RAID bios version
  • RAID backplane bios version
  • Windows Server 2008 driver

This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the change history is just.. well, scary.

We've tried both write-back and write-through strategies on the logical RAID drives. We still get intermittent I/O errors under heavy disk activity. They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.

We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.

When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original "incompatible" drive in bay 6. On a hunch, I began to assume that the Western Digital SATA hard drives I chose were somehow incompatible with the ServeRAID 8k controller.

Buying 6 new hard drives was one of the cheaper options on the table, so I went for 6 Hitachi (aka IBM, aka Lenovo) hard drives under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.

Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O "event" in this time frame. It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!

While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.

So my question is, is this sort of SATA drive incompatibility common with RAID controllers? Are there some brands of drives that work better than others, or are "validated" against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).


Solution 1:

Yes, I have encountered this with low-end cards and buggy drivers. However, no, not on an up-to-date Adaptec rebranded card. Wow is all I can say. One thing to consider, maybe it is more a bug with the drive than the RAID controller.

I don't have a good answer, but since you seem to have exhausted most of your options other than replacing the card, (and replacing the drives did the trick) here's a few ideas you can consider for your troubleshooting:

  • The WD drives were RE (RAID Edition) drives, right? The time limited error recovery is important, so if you don't have that and the drive is attempting to recover the sector, you are going to get a looooong pause from that drive. If the RAID controller is being patient and not dropping the drive you'll have a big problem on your hands.

  • Check the SMART data on the drives you removed and see if there is anything interesting.

Another comment about the importance of time limited error recovery (TLER) feature, from NAS / RAID vendor support:

As I mention before, we always suggest customers to use enterprise level drives if they use the drives in RAID settings. Enterprise level drives have more consistent responding time so that the RAID will be safer.

Solution 2:

Even for non-RAID, plain-old desktop hard drives, buying drives from the vendor (at the expected ridiculous markup) can often make a difference. For example, Apple is careful to only ship drives that are actually capable of honoring Mac OS X's F_FULLSYNC fcntl() flag, which goes a long way towards making sure things like Time Machine backups work reliably.

Again, this is plain vanilla desktop use with no RAID involved. Anything more complex than that and you definitely want to buy, if not the vendor's own over-priced drives, then at least drive models that you know for sure are on the vendor's "approved" list.

So, to answer your question, is it common? I'd say, yes, more common than you might think, even beyond the enterprise realm.

Solution 3:

I don't think it's common per se. However, as soon as you start using enterprise storage controllers, whether that be SAN's or standalone RAID controllers, you'll generally want to adhere to their compatibility list rather closely.

You may be able to save some bucks on the sticker price by buying a cheap range of disks, but that's probably one of the last areas I'd want to save money on - given the importance of data in most scenarios.

In other words, explicit incompatibility is very uncommon, but explicit compatibility adherence is recommendable.

Solution 4:

I wouldn't dream of using SATA disks for a server - none of them have the expected duty cycle of a server quality drive and they don't have the rich command set that SCSI/SAS has for monitoring drive performance and health. Lenovo servers are cheap and great if you have lots of servers with none of them really that important but there's a reason that HP's 300-series servers account for 40% of the market - they work. In particular their 'SmartArray' disk controllers are matchless in reliability and performance and their pre-failure guarantee is a welcome addition. Not the cheapest but how much is your time worth? I've been buying their (well Compaq first tbh) servers for twenty years now and have no issue whatsoever buying the 500-800 new ones a year that I do. Seriously check them out.

Solution 5:

The answer as always is "it depends".

For certain enterprise storage (say EMC), the vendor will specifically qualify drives and even go to the extent of loading custom firmware.

As Mark says, I find it to be the best when you follow a vendor's approved list if there is one. The initial cost savings is outweighed by the time spent trying to hunt down gremlins.