Clone only space in use from hard disk

Can I use dd, rsync, clonezilla or any tool to clone only space in use in my hard disk in Linux? I need to do a backup from a 1 TB HD (with only 2 GB space in use) into a 500 GB HD.


Solution 1:

You can, but you should prepare your disk first. The trick is to use sparse file or compression. This method is time consuming, it generates high I/O. In your case (2GB in use on 1 TB HDD) a file copy (as suggested in sawdust's comment) will probably be a way better solution. If – on the other hand – you had e.g. 850 GB in use out of 1 TB, many small files therein, you wanted to backup MBR, partition table, metadata, all that at once – then my method would be a reasonable way to save at least 150 GB on the image file (which still couldn't fit into 500 GB HDD, unless the data compressed well enough).

I'm writing this for users with higher disk usage. Also note that the source drive should be healthy and allow to overwrite the empty space. I'm giving the solution mainly for backup, not recovery nor forensics. The time and I/O cost will be paid not only during image creation but also when (if) the image is written back to disk. Think twice if the method is right for you.

Let's say you need to clone /dev/sdb and there are several partitions: /dev/sdb1, /dev/sdb2

Preparation

To take high advantage of sparse files or compression you should overwrite the empty space with zeros. In case of Windows partition there may be some trouble due to Windows hibernation, read this.

## Most commands need sudo.
mount -o rw /dev/sdb1 /mnt
dd if=/dev/zero of=/mnt/zero_file bs=32M
## Long wait here. Expect the following outcome: (which means that all empty space was zeroed)
### dd: error writing '/mnt/zero_file': No space left on device
sync
rm /mnt/zero_file
umount /dev/sdb1
## Repeat this with /dev/sdb2, /dev/sdb3 etc.

If there are major gaps in the partition layout then you should also fill them up with zeros. Swap partitions (if any) need special treatment in order to make the resulting image as small as possible. The Windows files like hiberfil.sys, pagefile.sys and swapfile.sys may be removed before zero_file creation. I won't cover these cases in detail here.

Sparse file method

This method may be used if the target filesystem (where the image file will be saved) supports sparse files. To generate a sparse image file, invoke:

## dd probably needs sudo here.
dd if=/dev/sdb of=/foo/bar/my_image.dd bs=512 conv=sparse

(EDIT: originally there was bs=32M but it's not the good choice with conv=sparse. Compare this question.)

To write the image back:

## dd probably needs sudo here.
dd if=/foo/bar/my_image.dd of=/dev/sdb bs=32M

Advantages:

  • The image may be mounted (mount -o offset=… or use kpartx) to access the files within.

Disadvantages:

  • Target filesystem must support sparse files.
  • You should remember to keep it sparse while copying (cp --sparse=always).

Compressed file method

To generate the image:

## dd probably needs sudo here.
dd if=/dev/sdb bs=32M | gzip -c > /foo/bar/my_image.dd.gz

To write the image back:

## dd probably needs sudo here.
gzip -cd < /foo/bar/my_image.dd.gz | dd of=/dev/sdb bs=32M

These commands might be built without dd, with gzip only. I used dd to ensure 32 MiB buffer.

Advantages:

  • The resulting file is non-sparse, it needs no special treatment.
  • The image size will be reduced even more if the files on your source disk are prone to compression.

Disadvantages:

  • It is hard to access the files within the compressed image without full decompression (some FUSE may be useful, although I'm not sure, never tried; consider a squashfs approach).

Hints

  • Long after I wrote the first version of this answer I learnt there is virt-sparsify tool. It looks useful.

  • To compress fast use gzip --fast, to compress best use gzip --best. Refer to man gzip for more options.

  • Use pigz instead of gzip if you can. This should speed things up, because pigz can utilize more than one processor core. You can use another compressor if you like.

  • To monitor the progress invoke dd with status=progress operand. If dd is already running without it (e.g. your dd doesn't support status=progress or you forgot to use it), send USR1 signal to the tool (this doesn't kill the running dd command):

      kill -s USR1 $(pidof dd)
    

    and repeat as needed.

  • As an alternative to dd you may use pv to read. Examples:

     pv -B 32m /dev/sdb | dd of=/foo/bar/my_image.dd bs=512 conv=sparse
     pv -B 32m /dev/sdb | gzip -c > /foo/bar/my_image.dd.gz
    

Solution 2:

If the target disk is already formatted, the second disk is plugged into the same machine as the first, is mounted, and if you're running Linux or Mac:

rsync -avP --ignore=/media/disk2 / /media/disk2

If the target disk is already formatted, the second disk is formatted and mounted into another PC, and if you're running Linux or Mac:

rsync -avP / user@ip_of_disk2_host:/media/disk2

This assumes you're just wanting a backup of the files without regard to the underlying drive. This does a PER FILE backup and will run rather quickly on only 2 GB of data.