ddrescue, “size on disk” lower than total size, with possible impact on performance when writing to NTFS

This answer investigates the behavior of ddrescue to address the main question. If you're not interested in testing procedure then you may skip to my conclusions and interpretation near the end.

Testbed

$ uname -a
Linux foo 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue
Ubuntu 14.04.5 LTS \n \l

$ ddrescue -V
GNU ddrescue 1.17
…

The filesystem is btrfs; it shouldn't matter though as long as it supports sparse files.

Testing

At first I got 8 MiB of random data:

dd if=/dev/urandom of=random.chunk bs=1M count=8

Then I made it a loopback device and remembered its name:

loopdev=`sudo losetup -f --show random.chunk`

Next I created yet another device that consisted of

  • chunk 0: unreadable, 1 MiB
  • chunk 1: zeros, 2 MiB
  • chunk 2: unreadable, 4 MiB
  • chunk 3: data from random.chunk, 8 MiB
  • chunk 4: unreadable, 16 MiB

The code (it uses here document syntax):

sudo dmsetup create mydevice << EOF
    0  2048 error
 2048  4096 zero
 6144  8192 error
14336 16384 linear $loopdev 0
30720 32768 error
EOF

I confirmed with gdisk -l /dev/mapper/mydevice that the total size is 31 MiB as it should be.

Actual reading is done with:

ddrescue     /dev/mapper/mydevice  normal.raw  normal.log
ddrescue -R  /dev/mapper/mydevice normalR.raw normalR.log
ddrescue -S  /dev/mapper/mydevice  sparse.raw  sparse.log
ddrescue -RS /dev/mapper/mydevice sparseR.raw sparseR.log

And the results of ls -hls *.raw are

 10M -rw-rw-r-- 1 kamil kamil 15M Sep 10 00:37 normal.raw
 10M -rw-rw-r-- 1 kamil kamil 15M Sep 10 00:37 normalR.raw
8.0M -rw-rw-r-- 1 kamil kamil 15M Sep 10 00:37 sparse.raw
8.0M -rw-rw-r-- 1 kamil kamil 15M Sep 10 00:37 sparseR.raw

To be sure, I confirmed with cmp that all four files are identical when you read them. Four logfiles contained the same map of erroneous and healthy sectors.

Notice that

  • 15 MiB means the last chunk is missing;
  • 10 MiB indicates chunk 1 and chunk 3;
  • 8 MiB indicates chunk 3 only.

Cleaning

sudo dmsetup remove mydevice
sudo losetup -d $loopdev
unset loopdev
rm random.chunk normal.raw normal.log normalR.raw normalR.log sparse.raw sparse.log sparseR.raw sparseR.log

Conclusions

  • When it comes to file size, it doesn't matter whether you read in reverse (-R) or not.
  • Unreadable chunk at the very end of the input file doesn't contribute to the overall size of the output file.
  • Unreadable chunks that do contribute to overall file size are always sparse (if target filesystem supports this, of course).
  • The -S option only affects blocks of zeros that were actually read from the input file.

Interpretation

Above there were facts. This section is more like my opinion.

It appears ddrescue tries to save you diskspace whenever it can do this without additional work. When you use -S the tool has to do some computations to check if a given data block is all zeros. If there's a read error it doesn't need to compute anything, it can make the fragment sparse in the output file with no cost.

Solution

You wrote:

using the -R switch (“reverse”) at the beginning, figuring that it would allocate the whole size of the input HDD right away

We just saw it's a false assumption. In fact you described what -p does. ddrescue -p will preallocate space on disk for output file. When I did this during my tests the output file had 31 MiB and was not sparse (even with -S).


I made a different test on my own.

– I created a simple template ddrescue log/map file containing this :

0x00000000  0x100000  ?
0x100000  0x3FE00000  +
0x3FF00000  0x100000  ?

(Which means : within one GB of data in total, the first and last MB haven't been tried, the rest is considered as “rescued”.)

– I ran ddrescue with that log/map file, using this command (with the rescued image from the recovery of that 1TB HDD as input, cutting the output at 1GB) :

ddrescue -s 1073741824 [rescued_image_file] [test1GB] [test1GB.log]

The resulting [test1GB] file has a total size of 1GB as expected, but a “size on disk” of 2MB, meaning that only the data which was actually copied (first and last MB) has been allocated.

– Then I ran ddrescue with that 1GB file as input, with no template this time, first without and then with the -S switch (“sparse writes”).

ddrescue [test1GB] [test1GB-NS] [test1GB-NS.log]
ddrescue -S [test1GB] [test1GB-S] [test1GB-S.log]

And it appears that :

  • [test1GB-NS] (non-sparse) has a “size on disk” of 1GB -- so the whole file has been allocated and copied, even the empty sectors ; whereas...
  • [test1GB-S] (sparse) has a “size on disk” of only 1,2MB or 1114112 bytes -- meaning that the empty sectors have not been allocated, even those contained in the first and last MB.

I thought that “sparseness” was an all-or-nothing concept, just like file compression, yet apparently there is such a thing as a “partially sparse” file, and indeed ddrescue appears to be saving space that way -- which is not necessarily an advantage (and might indeed have an impact on performance) ; there should be a switch to make it allocate the full size of the output file on-the-fly (as opposed to pre-allocating which can be very long if the input is large), just like it does (obviously) when writing directly to a device or partition.