Sparing level on HP EVA 4000

One of the disks of our EVA4000 died today. This diskgroup (all volumes vraid5 with sparing level 1 and almost no space left for more volumes, 1TiB drives) is being rebuilt with "spare space" right now, and it will take at least 15 hours to do the leveling/rebuilding.

We can't get a new disk until Friday. So, the question is, what would happen if another disk dies before the leveling completes? Would we lose data? And after that, how many aditional disks could die before losing data? 1 or 2?

In "usual" RAID, we would be vulnerable to data loss while the rebuild takes place, but in this case the space reserved for sparing is two times the size of the bigger disk, so at the very least the effect should be the same of having two spares.

Thanks in advance.

Update: I have found some interesting threads about this question but still can't answer to this question, so I'm starting a bounty.

http://blog.thestoragearchitect.com/2008/10/27/understanding-eva/

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&url=http%3A%2F%2Fwww.experts-exchange.com%2FStorage%2FStorage_Technology%2FQ_25548177.html (Expert Exchange question from google).

Short version

Leveling is the process after the rebuilding. If your array is leveling, you are just as safe as you were before the disk failed.

Long version

When you lose a disk, EVA will automatically try to use any of the space on the remaining healthy disks to create a redundant copy of the data that used to be on that disk. If you had one volume group with one big virtual disk with Vraid5 parity and you lost a single disk, the EVA will regenerate the data that used to be on the failed disk on the free space on the first disk. If there isn't enough space it will use 2, 3 or more disks but you will get a redundant copy of your data in the shortest time possible. How long that takes, I cannot tell you. But you will be back to the "you can lose a disk and not lose your data" state in a very short time. That is, of course, if you have enough free space in your disks.

You mentioned sparing. I am not familiar with this term but I hope you are talking about "failure protection level" which is the space that the EVA will reserve for an emergency like the one you are describing. Single protection level means that it will reserve the size of two of your largest disks, and double - the size of four disks. EVA will not report this space as free. So if you have single protection level and are using 95% with 16 1TB disks, you will have 2TB reserved, and are only using 95% of the remaining 14TB. That is 13.3TB used, and 2.7TB free. And if you take the Vraid5 into account, that is 10.64TB usable space and 2.66TB wasted for parity.

Once the EVA has made a redundant copy on as few disks as possible, it will start leveling (I personally prefer to call it "balancing") the data. This process involves moving the data around so all your disks end up with approximately the same amount of data in the end. This process takes awfully long time, especially if your usage is quite high, but you are safe if you have another failure at this time.

Go in Command View and check the status of the volume group. If it says that it is leveling - you are just as safe as you used to be before the failure.

You are now down to 15TB of raw disk space and you are using 13.3TB. The EVA wants to maintain a single protection level but it cannot reserve 2TB (you only have 1.7TB unused) so it is probably reporing the requested protection level as single, and the actual protection level as none. It may also be reporting your usage as going over 100%, since you are using 13.3TB and to satisfy the single protection requirement you should be under 13TB (15TB total - 2TB reserved for single protection).

This still means that you can still lose another disk, and you will still have a healthy storage. You can lose a second disk, and it will be the Vraid5 redundancy that is going to protect your data (though you may see a degradation in performance). And of course, if you are lucky you may survive a third and a fourth disk failure, as long as they are not in the same Vraid stripe (EVA's Vraid5 is more like RAID5+0, with stripes spanning over 5 disks).

Update: Unrelated to your question, but the latest FATA firmware update has a "Fix for self-initiated resets that may occur under rare circumstances". Believe me, it does not feel nice to see disks get thrown out of a volume group for no reason.

Update 2: Updated because single protection level means the space for two disks.

I had a similar experience with my MSA 4400. We kept it running at 95% capacity, but it started having some 9 drive failures a month, so I'm somewhat familiar with the ragged edge of data loss disaster.

You have several levels of scratch space that can prevent you from losing data, and it's hard to tell which one you're currently into. Spare space is a big one, obviously. Also, the level of vraid you use will play a part. Also, even when you swap that drive, it'll have to rebuild again.

The main thing you need to watch for is the failure protection level on your pool. You can set a requested level (like double) and then compare that to the actual level (like single or none). That said, even if you go from double to none in a single drive failure (one of the things I hate most about this box is that it allows that), you still have several ways the array can prevent you from losing data using parity from vraid or other black magic.

For HP EVA:
Level 1 = the capacity of two of the largest drives configured are reserved for sparing

Which means if you loose 2 of your disks, you are left without spares, and rely only on RAID5 parity. In your current situation, you can loose 1 more disk w/o array degradation, and 2 more without data loss, but with degraded performance. In our organizations we have ALWAYS 2 spare disks outside of the enclosure and kept at the same temperature (so no tempering will be needed before insertion).

Sparing level on HP EVA 4000

Short version

Long version

Related

Recent Posts