quick way to find large stretches of zeros in files
I recently partially recovered a 2.5TB faulty disc. ddrescue
created an image, which I can mount in loopback mode, 2.1TB are recovered, 450GB are missing, unfortunately spread all over the disk.
To see which files are affected, I could use filefrag -v
and look at the map file generated by ddrescue
.
BUT that would take ages. I found that since it’s only video files I’m recovering, large stretches of zeros are not to be expected, but they are present, where ddrescue
didn’t read data from the disk.
So I would need a command to scan a file if there is an (arbitrary) large patch of all zeros in the file. In reality, these would be always a multiple of 512 bytes, and always begin at a 512 byte address. Is there a command that can scan a file for such a binary byte sequence (i.e. 512× '\0')?
I've modified xenoid's answer to look specifically for null bytes, based on this other question's answer about how to grep for null bytes:
grep -Pal '\x00{512}' the_files
Making grep
look explicitly for null characters eludes me. However, making it look for 512 consecutive identical characters (which are about as unlikely) is somewhat simpler:
grep -Eal '(.)\1{511}' the_files
lists the files where a sequence of 512 identical characters has been found. The -a
parameter is necessary to make it match null characters (otherwise they are considered as end-of-line characters and ignored).
xenoid's answer will probably find affected files for you quickly. To confirm and analyze further you may run:
<"file" tr '\000-\377' 'oL' | fold -w 512 | grep -vn 'L' | cut -f 1 -d ':'
It works as follows:
-
"file"
is opened and streamed to the first command. -
tr
converts every null character too
, every non-null character toL
. -
fold
inserts a newline after every 512 characters. At this moment the stream can be treated as pure text. -
grep
takes lines that do not containL
and prints them with their numbers. -
cut
isolates these numbers (purgesooo…
).
This way you get ordinal numbers of 512-byte chunks filled with zeros. The numbering starts with 1
. Pass the output to wc -l
to see how many chunks are affected in a given file.
Different approach, therefore another answer from me.
You can use ddrescue
itself to search for zeros. Use --generate-mode
.
When
ddrescue
is invoked with the--generate-mode
option it operates in "generate mode", which is different from the default "rescue mode". That is, if you use the--generate-mode
option,ddrescue
does not rescue anything. It only tries to generate amapfile
for later use.[…]
ddrescue
can in some cases generate an approximatemapfile
, frominfile
and the (partial) copy inoutfile
, that is almost as good as an exactmapfile
. It makes this by simply assuming that sectors containing all zeros were not rescued.[…]
ddrescue --generate-mode infile outfile mapfile
(source)
Let's pretend your file is outfile
from previous ddrescue
run. We cannot use it as infile
(because ddrescue
refuses to work when infile
and outfile
are the same file), we need a dummy one, /dev/zero
will do. To find every zero you need -b 1
. This is the command (mapfile
must not exist):
ddrescue -b 1 --generate-mode /dev/zero file mapfile
Every entry with ?
in the list of data blocks inside the mapfile
means a block of zeros (with -b 1
one zero is also a block). See mapfile structure for ddrescue
. You can then retrieve information from the mapfile
.
For example the following command will give you the length (hexadecimal, in bytes because of -b 1
) of the largest block of zeros (empty output means there was none):
grep '0x.*0x.*[?]' mapfile | awk -F ' ' '{print $2}' | sort -ru | head -n 1
To speed things up you may want to use larger block size (-b
), but then blocks of zeros that start within one block and end within the next may go unnoticed even if they are slightly longer than the chosen block size; their offset becomes important.
To not miss any stretch of zeros of length N
bytes or more, you need a block size of at most M=$(((N+1)/2))
bytes (e.g. at most 5
for N=10
, 6
for N=11
). The command
ddrescue -b "$M" --generate-mode /dev/zero file mapfile
will generate a mapfile where every line with ?
in the list of data blocks means at least M
zeros (at the right offset), but every stretch of N
zeros (regardless of its offset) will generate such line for sure. Since two blocks of M
are at least N
, the following reasoning applies:
Taking lines with ?
from the list of data blocks,
- if the length (second column in the
mapfile
, remember the unit isM
) is0x2
or greater then you do haveN
or more zeros at this position; - if the length is
0x1
then you should investigate further if there are at leastN
zeros around this position; - if there is no such line then there is no stretch of
N
zeros in the file for sure.
In reality, these would be always a multiple of 512 bytes, and always begin at a 512 byte address
In this case
ddrescue -b 512 --generate-mode /dev/zero file mapfile
will find and map them all.