Copying data over with rsync causes size discrepancies

Solution 1:

The discrepance is mostly likely caused by more sparsely-populated file on the old disk.

Anyway, let's first check that file and inode numbers are the same:

  • issue find <path> | wc -l on both mountpoints. Is the number of file/directory the same?
  • issue df -i. Is the number of inodes the same?

If the answer to both question is yes, than the difference can be explained by more sparsely file on the new disk. But what are sparse files? In short, sparse files are normal files which are smaller than they appear. This is possible thank to a feature of (relatively) modern filesystems which, instead to write all zeroes to a file, simply set a flag telling the system "this file (or part of) is full of zeroes, don't let me write them all".

By default, du reports the real space taken by the file, and not it apparent size. To show apparent size, use du --apparent-size (for other options, please see du manpage)

For a practical example, you can create a sparse file using the command truncate test.img -s 1G. As reported by ls, the newly created file is 1 GB in size, but if you try du -hs test.img, you'll see a very, very small filesize (possibly even zero!). How it is possible? As stated above, modern filesystem sometime "lie" to the appliations, reporting back an allocated size which does not exists in reality. On the other side du -hs --apparent-size test.img will print the same size as ls.

As you start writing into a sparse file, the filesystem will dynamically allocate the required space. For example, issuing dd if=/etc/services of=test.img conv=notrunc,nocreat will write some data into the previously all-sparse test.img file. Now, running du -hs test.img will report the ~600 KB allocated for data storage.

An obvious, but very important implication is that sparse file support can only optimize for zero-filled files (or part of). The very same moment your write to a file, its allocated space begin to grow. This is true event if you write other zeroes to the file, unless the application know how to handle sparse files (in this case, the application will advise the filesystem that it is going to write all zeroes, and the filesystem optimize accordlying).

What if you want to really preallocate some space? Then you can use fallocate test.img -l 1G. If you execute ls; du -hs test.img; du -hs --apparent-size test.img, you'll see that all tools report the very same size, because the file was really fully allocated by the fallocate call.

In short, it is possible that, during the copy, some file were recreated in a less sparsely manner, replacing sparse sections with "real" zeroes. To use sparse file with rsync you had to use the -S option.

Solution 2:

When I've seen differences like this in the past it was usually due to a difference in the block size of the drives. This is especially true if the original drive is older. You can verify this with the following.

tune2fs -l /dev/sdXX | grep -i 'block size'

Solution 3:

Your rsync options won't copy hardlinks, try adding -H

-H, --hard-links This tells rsync to look for hard-linked files in the transfer and link together the corresponding files on the receiving side. Without this option, hard-linked files in the transfer are treated as though they were separate files. When you are updating a non-empty destination, this option only ensures that files that are hard-linked together on the source are hard-linked together on the destination. It does NOT currently endeavor to break already existing hard links on the destination that do not exist between the source files. Note, however, that if one or more extra-linked files have content changes, they will become unlinked when updated (assuming you are not using the --inplace option).

Sparse files, such as VM images, may be also inflating usage by replacing voids with real blocks. Try using --sparse option with rsync.

You could also try using diff to compare the directories trees. See https://stackoverflow.com/questions/4997693/given-two-directory-trees-how-can-i-find-out-which-files-differ