Converting LVM/EXT4 to ZFS without losing data

I have a home media server with 2 x 3TB drives in it. Its currently set up using mdraid (1), LVM and EXT4. The setup was done using the ncurses Ubuntu Server installer.

Goal

Convert the setup to use ZFS (RAIDZ) and add a 3rd 3TB drive. I want to enable on-the-fly compression and deduplication. The conversion should not require a reinstall of Ubuntu and all the packages. There should be no data loss (unless a disk crashes during the process of course).

How do i do this?

Bonus question, is it better to do this using btrfs, because as i understand it i can initialize the array with one disk, copy the data over and then add the second disk with btrfs but not with zfs?

my /proc/mdstat:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sdb2[1] sda2[0]
      2930070069 blocks super 1.2 [2/2] [UU]

unused devices: <none>

my pvs -v

    Scanning for physical volume names
  PV         VG   Fmt  Attr PSize PFree DevSize PV UUID
  /dev/md4   Data lvm2 a-   2,73t    0    2,73t MlZlTJ-UWGx-lNes-FJap-eEJh-MNIP-XekvvS

my lvs -a -o +devices

  LV     VG   Attr   LSize  Origin Snap%  Move Log Copy%  Convert Devices
  Data   Data -wi-ao  2,68t                                       /dev/md4(13850)
  Swap   Data -wi-ao  7,54g                                       /dev/md4(11920)
  Ubuntu Data -wi-ao 46,56g                                       /dev/md4(0)

Short:

  • I think there is no in-place conversion of ext4 to ZFS.

  • For media servers I'd recommend SnapRAID instead

Edit:

Before I go deeper: Remember to use ECC RAM with ZFS. SnapRAID is not that demanding as it runs in userspace on top of existing filesystems, so bad RAM should only affect the parity drives, but leave alone other existing data.

OTOH most of my ZFS machines doe not have ECC RAM.

  • **In that case enable Linux Kernel RAM checks!
    • Debian: GRUB_CMDLINE_LINUX_DEFAULT="memtest=17" or similar in /etc/default/grub
  • Until now I have no bad experience, even that sometimes NonECC-RAMs fail.
  • ZFS with non-ECC is extremely dangerous as ZFS is CoW, so if the wrong RAM zone gets hit it might kill the ZFS top-level structure including it's redundancy, unnoticed, and then all data on ZFS is immediately corrupted afterwards.

Longer:

ZFS uses a very unique and very special internal structure, which does not fit well to the structure of ext4. (Read: Perhaps somebody could create some artificial ZFS structure on top of some ext4, but I doubt this conversion is quick, reliable and easy to use.)

SnapRAID, however, does not need that you convert anything. Just use it with (nearly) any existing filesystem, to create redundancy for the files there, such that you can check and recover files in case there is drive failure or (silent) corruption.

Pros/Cons:

  • SnapRAID is inefficient if it must create redundancy for many small files, as each file creates a certain overhead (Padding) in the parity.

  • SnapRAID does not offer compression itself.
    On a Media Server you usually do not need compression, as media usually is compressed already (MP4, JPG, PDF).
    If you happen to use some filesystem which allows compression, you can use it. But only on device level, not on the complete pool (like ZFS does).

  • SnapRAID does not offer deduplication on a block level.
    On a Media Server the snapraid dup feature usually is enough, as media files normally do not share a lot duplicate blocks. (Exception: youtube-dl. If you download a video two times with same quality, it differs in a few bytes only. Not always. But quite often. Just keep the Youtube video ID in the filename to identify two similar files.)

  • ZFS dedup needs a lot of memory. Plan 1 GiB RAM per 1 TiB data, better more!
    If there is not enough RAM you need to add some hyper fast SSD cache device. ZFS needs to lookup 1 random sector per dedup block written, so with "only" 40 kIOP/s on the SSD, you limit the effective write speed to roughly 100 MB/s. (Usually ZFS is capable to utilize the parallel bandwidth of all devices, so you can easily reach 1 GB/s and more write speed on consumer hardware these days, but not if you enable dedup and do not have enormous amounts of RAM.)

  • Note that I never had trouble where SnapRAID was needed to recover data. So I cannot swear, that SnapRAID is really able to recover data.
    Edit: There already was enough trouble at my side, and SnapRAID always worked as expected. For example some time ago a drive went dead and I was able to recover the data. AFAICS the recovery was complete (from the latest SNAP taken). But such recovery process can take very long (weeks), and it looks to me that it is not as straight forward as with ZFS, expecially if the recovery process must be interrupted and later be restarted (with SnapRAID you should exactly know what you are doing).

  • On ZFS you must plan ahead. You need to know and plan every aspect of the whole lifecycle of your ZFS drive in advance, before you start with ZFS.
    If you can do this, there is no better solution than ZFS, trust me!
    But if you forget about something which then happens in future unplanned, you are doomed. Then you need to restart from scratch with ZFS:
    Create a new second fresh and independent ZFS-pool and transfer all data there.
    ZFS supports you in doing so. But you cannot evade to duplicate the data temporarily. Like you need when you introduce ZFS.

  • Administering ZFS is a breeze. The only thing you need to do regularly is:
    zpool scrub
    That's all. Then zpool status tells you how to fix your trouble.
    (More than 10 years ZFS now. On Linux. Simply put: ZFS is a lifesaver.)

  • OTOH on SnapRAID, you do not need any planning. Just go for it. And change your structure as-you-go, when the need arises.
    So you do not need to copy your data to start with SnapRAID. Just add a parity drive, configure and here you go.

  • But SnapRAID is far more difficult to administer in case you are in trouble.
    You must learn how to use snapraid sync, snapraid scrub, snapraid check and snapraid fix. snapraid status is a help most times, but often you are left puzzling, what might be the correct way to fix something, as there is no obvious single best way (SnapRAID is like a swiss army knife, but you need to know yourself, how to properly handle it correctly).

  • Note that, on Linux, you have two different choices on ZFS:

    • ZFSonLinux, which is a kernel extension.
      Newer Kernels which you will see on Ubuntu 20.04 probably will be incompatible.
      Edit: You can upgrade from older ZFSonLinux or even ZFS-FUSE to a current ZFSonLinux without problem. However if switching to new features, there is no way back to the older version.
    • ZFS-FUSE, which usually is a bit slower, is independent of the kernel.
    • Both have Pros and Cons, this is beyond the scope of this answer.
    • If ZFS is not available (perhaps you want need to repair something), all your data is inaccessible.
    • If a device is failing, depending on your redundancy used, either all data is fully accessible, or all data is lost completely.
  • Edit:

    • Today (2021) ZFS-FUSE apparently has aged and became a bit unstable, it looks like some incompatiblily of userspace code vs. newest kernels (perhaps on FUSE level). It did not crash nor corrupted existing data, but IO suddenly stopped so the FS became unresponsive until killed and restarted (which leaves the old mount points in some half-closed state in case something still accesses the dead mounts). Switching to ZFSonLinux fixed this issue for me.
    • But ZFS-FUSE still has it's use, as it is a full userland process. You can kill and restart it, so it's very easy to control without perhaps reboot the kernel.
    • So my recommendation is: Do ZFS experiments with ZFS-FUSE. And when you are comfortable with it, switch to ZFSonLinux.
  • SnapRAID is GPLv3 and entirely an addon on userspace.

    • If SnapRAID is not available, all your data still is kept intact and accessible.
    • If a device is failing, all data on the other devices still is intact and accessible.

Generally a Media Server has the propery of long keeping old data and is ever-growing. This exactly is, what SnapRAID was designed for.

  • SnapRAID allows to add new drives or even new parities later on.

  • You can mix different filesystem on all drives. SnapRAID just adds the redundancy.

  • SnapRAID is not meant as backup.
    On Media archives you quite often do not need a backup at all.

  • ZFS RAIDZ is not meant as a backup either.
    However zfs send in combination with zfs snapshot offers some very easy to use 24/7 on-the-fly backup and restore feature.

  • ZFS is meant for filesystems, where it is crucial that they never have a downtime. Nearly everything can be fixed on-the-fly without any downtime.
    Downtimes only happen in case the redundancy/self-healing of ZFS is no more capable to repair the damage. But even then, ZFS is more than helpful, and lists you all your lost data. Usually.

  • OTOH SnapRAID can recover data, but this is done in an offline fashion.
    So until recovered, the data is not available.
    It is also helpful to find out which data is lost. But this is more difficult, than with ZFS.

Best practice recommendation with SnapRAID (ZFS is beyond this answer):

  • Stay with LVM!
    • Make each drive a full PV. No partition table.
      If you want to encrypt the drive, put the PV inside the LUKS container.
    • Put each such PV into it's own VG. This way a failing drive does not harm other VGs.
      You can aggregate several smaller drives into the same VG to have every LV (device for SnapRAID) at a similar size.
    • Create a bunch of similar size data LVs, one each on one VG.
    • Leave enough room (100GB) on the VGs for creating snapshots and small adjustments of the filesystems.
    • Create a bunch of Parity drives which are bigger (ca. 10%) than the data LVs.
    • At the end of each PV there should be some free room (see enough room above).
      This is for modern filesystems which create superblock copies at the end. If you fill the PV completely, those FS (ZFS) might detect the whole drive or the PV instead of the LV.
      (This does not happen if you use encrypted drives.)
  • Create the FS of your choice for your data drives on those LVs.
  • Perhaps use ZFS for parity drives on the LVs.
    • (This is curently still at experimentation on my side.)
    • Each SnapRAID parity drive should be it's own pool.
    • No compression/dedup needed here.
    • Note that this is not very straight forwardperfect, as ZFS usually creates it's FS in /.
    • To zvol or not to zvol, that's an unsolved question at my side.

How to configure and administer SnapRAID is beyond the scope of this answer.

Why ZFS for parity:

Well, when a data-drive goes bad (unreadable sectors) you can copy the readable files. Unreadable files are found this way easily. You then can recover it.

Copying over still readable data dramatically speeds up the recovery process.

However, SnapRAID parity is just one big file. If you copy this file, you want to be sure, it has no silent corruption. ZFS ensures this independently of SnapRAID. In case there is corruption, ZFS tells you so, such that you know you must check the parity file.

Checking the complete file in case of just some defective sectors takes ages, as all data of all drives must be read in completely.

There is very likely a way to just only check the parts which were corrupt on a parity file. However I am not sure as I haven't needed that yet.

Why not BTRFS?

  • ZFS runs completely stable, silent and flawless for years already.
  • OTOH it's 20192021 and BTRFS still has trainloads of serious problems.
    • Edit: For me it is serious if basic features are not completely reliable, this is they must not increase trouble if things go bad. Maybe I am just lacking the experience, but reading the docs I have not the slightest idea how to reach this goal with BTRF yet.
  • For example, BTRFS reacts unpredictable and unstable if you fill it completely.
    In contrast no such problems are known with ZFS.
  • It's likely that you sometimes hit the a full parity if you are not careful with SnapRAID, as you want your parity drive to use 95%+ of all available space.
    (SnapRAID is a bit inefficient if it has to put many small files into parity.)
    This is no problem for ZFS (it gets a bit slower on nearly filled pools, but that is OK here).

Edit: Why encryption?

See also my way using LUKS

Encryption is just the way to be able to replace drives without worrying about the data on it. If an encrypted drive is removed from the system, all data is automatically rendered unreadable (as long as encryption cannot be broken) as the key is unique to the computer.

All devices on the same computer get the same key which resides in an USB thumb drive or some other old scratch media (something you can destroy completely). A backup exists printed on paper.

Nowadays with SSD I usually have a single dedicated "small" (128 GB to 256 GB) system SSD which is unencrypted and carries the (unencrypted) key as well as the (unencrypted) full base system.

With EVO-type Micro-SD (like Samsung Endurance) you can even boot the entire system, like the Raspberry-PI, from Micro-SD (works as long as you do not need a high swapping throughput).

The system SSD is meant to be destroyed if it is ever removed from the system. No replacement, no warranty.

/etc/crypttab usually looks like this only:

# <target name> <source device>         <key file>      <options>
swap /dev/vg_system/lv_swap /dev/urandom swap,cipher=aes-xts-plain,size=256

All the crypted drives (LUKS) are activated by scripts. This has the advantage that ZFS (or something else) does not activate the data drives on boot. So everything (except the system) stays under full (possibly manual) control.

If something should automatically come up on boot, this is done with some (carefully written) scripts which are executed by a reboot-rule of cron.

These script first test the perfect health status of all drives (at my side this are more than 40) before it starts to make them productive, such that everything comes up after a power loss or reboot only, if everything is really ok, but naturally allows for manual intervention in all other situations (which are the most common case, because usually servers only crash or reboot when there was some really bad trouble, right?).