Linux Software RAID1: How to boot after (physically) removing /dev/sda? (LVM, mdadm, Grub2)

A server set up with Debian 6.0/squeeze. During the squeeze installation, I configured the two 500GB SATA disks (/dev/sda and /dev/sdb) as a RAID1 (managed with mdadm). The RAID keeps a 500 GB LVM volume group (vg0). In the volume group, there's a single logical volume (lv0). vg0-lv0 is formatted with extfs3 and mounted as root partition (no dedicated /boot partition). The system boots using GRUB2.

In normal use, the systems boots fine.

Also, when I tried and removed the second SATA drive (/dev/sdb) after a shutdown, the system came up without problem, and after reconnecting the drive, I was able to --re-add /dev/sdb1 to the RAID array.

But: After removing the first SATA drive (/dev/sda), the system won't boot any more! A GRUB welcome message shows up for a second, then the system reboots.

I tried to install GRUB2 manually on /dev/sdb ("grub-install /dev/sdb"), but that doesn't help.

Appearently squeeze fails to set up GRUB2 to launch from the second disk when the first disk is removed, which seems to be quite an essential feature when running this kind of Software RAID1, isn't it?

At the moment, I'm lost whether this is a problem with GRUB2, with LVM or with the RAID setup. Any hints?


Solution 1:

You need to install GRUB to the MBR of both drives, and you need to do it in a way that GRUB considers each disk to be the first disk in the system.

GRUB uses its own enumeration for disks, which is abstracted from what the Linux kernel presents. You can change which device it thinks is the first disk (hd0), by using a "device" line in the grub shell, like so:

device (hd0) /dev/sdb

This tells grub that, for all subsequent commands, treat /dev/sdb as the disk hd0. From here you can complete the installation manually:

device (hd0) /dev/sdb
root (hd0,0)
setup (hd0)

This sets up GRUB on the first partition of the disk it considers to be hd0, which you've just set as /dev/sdb.

I do the same for both /dev/sda and /dev/sdb, just to be sure.

Edited to add: I always found the Gentoo Wiki handy, until I did this often enough to commit it to memory.

Solution 2:

Have you considered installing a third drive to serve as just the boot drive? I have seen problems too with raid 1 lvm setups (on CentOS) not being able to boot the second drive. I think the problem stems from grub not being able to handle native lvm partitions, although I'm not entirely sure.

Anyway, that's my answer: install a third small drive solely for the purpose of booting the system. Heck, I bet you could even get clever and do that with some sort of little flash or ssd device.

Solution 3:

Grub should be able to recognize RAID1 setups and install to all slave disks when told to install to the MD device.