quick way to find large stretches of zeros in files

I recently partially recovered a 2.5TB faulty disc. ddrescue created an image, which I can mount in loopback mode, 2.1TB are recovered, 450GB are missing, unfortunately spread all over the disk.

To see which files are affected, I could use filefrag -v and look at the map file generated by ddrescue.

BUT that would take ages. I found that since it’s only video files I’m recovering, large stretches of zeros are not to be expected, but they are present, where ddrescue didn’t read data from the disk.

So I would need a command to scan a file if there is an (arbitrary) large patch of all zeros in the file. In reality, these would be always a multiple of 512 bytes, and always begin at a 512 byte address. Is there a command that can scan a file for such a binary byte sequence (i.e. 512× '\0')?

I've modified xenoid's answer to look specifically for null bytes, based on this other question's answer about how to grep for null bytes:

grep -Pal '\x00{512}' the_files

Making grep look explicitly for null characters eludes me. However, making it look for 512 consecutive identical characters (which are about as unlikely) is somewhat simpler:

grep -Eal '(.)\1{511}' the_files

lists the files where a sequence of 512 identical characters has been found. The -a parameter is necessary to make it match null characters (otherwise they are considered as end-of-line characters and ignored).

xenoid's answer will probably find affected files for you quickly. To confirm and analyze further you may run:

<"file" tr '\000-\377' 'oL' | fold -w 512 | grep -vn 'L' | cut -f 1 -d ':'

It works as follows:

"file" is opened and streamed to the first command.
tr converts every null character to o, every non-null character to L.
fold inserts a newline after every 512 characters. At this moment the stream can be treated as pure text.
grep takes lines that do not contain L and prints them with their numbers.
cut isolates these numbers (purges ooo…).

This way you get ordinal numbers of 512-byte chunks filled with zeros. The numbering starts with 1. Pass the output to wc -l to see how many chunks are affected in a given file.

Different approach, therefore another answer from me.

You can use ddrescue itself to search for zeros. Use --generate-mode.

When ddrescue is invoked with the --generate-mode option it operates in "generate mode", which is different from the default "rescue mode". That is, if you use the --generate-mode option, ddrescue does not rescue anything. It only tries to generate a mapfile for later use.

[…]

ddrescue can in some cases generate an approximate mapfile, from infile and the (partial) copy in outfile, that is almost as good as an exact mapfile. It makes this by simply assuming that sectors containing all zeros were not rescued.

[…]
ddrescue --generate-mode infile outfile mapfile

^(source)

Let's pretend your file is outfile from previous ddrescue run. We cannot use it as infile (because ddrescue refuses to work when infile and outfile are the same file), we need a dummy one, /dev/zero will do. To find every zero you need -b 1. This is the command (mapfile must not exist):

ddrescue -b 1 --generate-mode /dev/zero file mapfile

Every entry with ? in the list of data blocks inside the mapfile means a block of zeros (with -b 1 one zero is also a block). See mapfile structure for ddrescue. You can then retrieve information from the mapfile.

For example the following command will give you the length (hexadecimal, in bytes because of -b 1) of the largest block of zeros (empty output means there was none):

grep '0x.*0x.*[?]' mapfile | awk -F ' ' '{print $2}' | sort -ru | head -n 1

To speed things up you may want to use larger block size (-b), but then blocks of zeros that start within one block and end within the next may go unnoticed even if they are slightly longer than the chosen block size; their offset becomes important.

To not miss any stretch of zeros of length N bytes or more, you need a block size of at most M=$(((N+1)/2)) bytes (e.g. at most 5 for N=10, 6 for N=11). The command

ddrescue -b "$M" --generate-mode /dev/zero file mapfile

will generate a mapfile where every line with ? in the list of data blocks means at least M zeros (at the right offset), but every stretch of N zeros (regardless of its offset) will generate such line for sure. Since two blocks of M are at least N, the following reasoning applies:

Taking lines with ? from the list of data blocks,

if the length (second column in the mapfile, remember the unit is M) is 0x2 or greater then you do have N or more zeros at this position;
if the length is 0x1 then you should investigate further if there are at least N zeros around this position;
if there is no such line then there is no stretch of N zeros in the file for sure.

In reality, these would be always a multiple of 512 bytes, and always begin at a 512 byte address

In this case

ddrescue -b 512 --generate-mode /dev/zero file mapfile

will find and map them all.

quick way to find large stretches of zeros in files

Related

Recent Posts