ZFS grub-probe error failed to get canonical path of /dev/DISK_NAME

Background:

  • Ubuntu Xenial
  • ZFS installed for system disk (so, you know: rpool/ROOT)
  • System runs fine, but when kernel updates, grub-probe barks error mentioned above
  • I would rather not reboot right now

There's a discussion here about grub-probe and how it should "just be better", but this helps until that comes along. I got the idea from that discussion.

More detail: a complete instance of the error (for my system) looks like:

/usr/sbin/grub-probe: error: failed to get canonical path of `/dev/ata-ADATA_SP550_2G1520009135-part1'.

This is buried in a slew of detail spouted forth from an apt command to install graphics drivers (but that's not important).

This disk corresponds to one of my ZIL partitions. I added ZIL and cache after the install completed, so I suppose that's why I didn't see the problem before. I haven't yet rebooted, and that's why I'm seeing the problem at all. Yes, you can reboot to fix all this, but assuming you don't want to do that, read on:

If I look in /dev, I see links to all my ZFS disks that look like:

lrwxrwxrwx  1 root     root           4 Sep 16 23:31 ata-WDC_WD10EARS-00Y5B1_WD-WMAV51436394-part1 -> sdc1
lrwxrwxrwx  1 root     root           4 Sep 16 23:31 ata-WDC_WD20EZRX-00D8PB0_WD-WCC4MK86SWX7-part1 -> sdd1
lrwxrwxrwx  1 root     root           4 Sep 16 23:31 ata-WDC_WD20EZRX-00D8PB0_WD-WCC4N1085683-part1 -> sde1
lrwxrwxrwx  1 root     root           4 Sep 16 23:31 ata-WDC_WD2500JS-22MHB0_WD-WCANK4053187-part1 -> sda1

... but notably none for the ZIL partitions.

I can test the situation by running:

$ sudo grub-probe /
grub-probe: error: failed to get canonical path of `/dev/ata-ADATA_SP550_2G1520009135-part1'.

So the question is: how to fix this problem so grub-probe behaves?


There is an environment variable that fixes this. The issue from my reading seems to be that Grub likes the idea of 'supporting' zfs but not the idea of fixing issues related to zfs in Grub. Specifically its poor error handling in terms of finding things.

For instance, the grub tools that ship with Ubuntu 16.x will fail to find /boot on a ZFS volume without some user intervention, and then happily write some (but not all) needed files output from whatever utility you're using to the /boot folder that it just told you it couldn't find.

In any case...

http://list.zfsonlinux.org/pipermail/zfs-discuss/2016-June/025765.html

To check if you have commit (should see full paths):

ZPOOL_VDEV_NAME_PATH=1 zpool status

If so you can do:

ZPOOL_VDEV_NAME_PATH=1 grub-whatevs ....

You can pass the variable as input to the necessary grub utilities, or you can specify it as a shell variable in root's .bashrc or .profile or some such with...

export ZPOOL_VDEV_NAME_PATH=YES

The variable causes zpool to report full paths, rather than relative /dev paths to the disks which may or may not work properly with zfs. Grub utilities check zpool status for zfs pools to find the disks that contain them. Therefore changing the output of zpool status fixes grub.

I agree that users shouldn't have to deal with this, in reference to femulator's comment. The real solution? Same as every other open source project that languishes in bugs that never get fixed. Fork it, fix it yourself, and stop using the source project/library/whatever. The FOSS way of "firing" someone, in other words ;). Apparently Debian was aware of this particular bug seven years ago.

This was the only thing stopping me from successfully migrating a FreeBSD RaidZ boot pool to Ubuntu. If anyone else attempts something similar, the process is relatively simple, as long as you understand ZFS well enough to ignore the parts of the documentation from Grub and zfsonlinux that are wrong (such as setting your root dataset to not automount, eh...? How exactly is it going to boot then?). It's somewhat ironic that Ubuntu points out in their docs that the boot loader is Linux's most insecure 'feature', which is true I suppose, but in this case it's also Ubuntu's glaring flaw. It would have taken me an hour or two to migrate a BSD ZFS pool to another OS if I could have done it using the Sun/Solaris utilities that actually work. The problem is I had to use Linux utilities (like Grub) that don't (or barely) work at some point, so there lies the fault for the other two days I spent fixing this. Ubuntu would be a whole lot better if it didn't need grub to boot...


Assuming you don't want to reboot (see below), the answer to this turns out to be creating similar links for the disks that are missing. For me and my system, this meant adding these links:

$ cd /dev
$ sudo ln -sf sdf1 /dev/disk/by-id/ata-ADATA_SP550_2G1520009135-part1
$ sudo ln -sf sdf3 /dev/disk/by-id/ata-ADATA_SP550_2G1520009135-part3
$ sudo ln -sf sdg1 /dev/disk/by-id/ata-SPCC_Solid_State_Disk_EB84076413B201101308-part1
$ sudo ln -sf sdg3 /dev/disk/by-id/ata-SPCC_Solid_State_Disk_EB84076413B201101308-part3

(partition 2 on both disks is mirrored to be the ZIL for home, but grub-probe doesn't care about that)

The formula for this is to determine which disks grub-probe needs, then create symbolic links for these according to the pattern:

$ sudo ln -sf {sdname}{partN} /dev/disk/by-id/{diskid}-part{partN}

You can determine the disks required by repeating the command $ sudo grub-probe / and creating links until it's happy and ultimately reports:

$ sudo grub-probe /
zfs

Until it is happy, you'll see complaints that look like:

$ sudo grub-probe /
grub-probe: error: failed to get canonical path of `/dev/disk/by-id/ata-ADATA_SP550_2G1520009135-part3'.

Note that it's complaining specifically about -part3 and the disk name is ata-ADATA_SP550... Create the link:

Find the disk corresponding to ata-ADATA_SP.... by running:

$ ls -l /dev/disk/by-id | grep ata-ADATA_SP
lrwxrwxrwx 1 root root  9 Sep 17 13:45 ata-ADATA_SP550_2G1520009135 -> ../../sdf
lrwxrwxrwx 1 root root 10 Sep 17 13:49 ata-ADATA_SP550_2G1520009135-part1 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Sep 17 13:51 ata-ADATA_SP550_2G1520009135-part2 -> ../../sdf2
lrwxrwxrwx 1 root root 10 Sep 17 13:50 ata-ADATA_SP550_2G1520009135-part3 -> ../../sdf3

Note this is sdf, so your link command becomes:

$ sudo ln -sf /dev/sdf3 /dev/disk/by-id/ata-ADATA_SP550_2G1520009135-part3

Rinse and repeat until the grub-probe command succeeds. If you know what you're doing and you know which partitions are used by any disks participating in the ROOT pool, by all means go directly to the end and link those instead of using grub-probe to tell you what to do.

You may think that this would subvert the reason you used those /dev/disk/by-id/* names in the first place. If the /dev/sd* paths change, you're hosed and have to make the links again, right? Turns out that an alternative to all this is to reboot the host: it creates the links on reboot.


What is the ZFS bug/fix for this? users shouldn't have to deal with this -> https://github.com/zfsonlinux/grub/issues/5