Trying to find files that contain only NULs, but getting some others

The files I am trying to find/list are:

  • Any size (0 bytes accepted)
  • Consist only of ASCII NUL characters (0x00)
  • If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:

grep -RLP '[^\x00]' .

Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.

Is there any better command to find such files?


Solution 1:

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^\x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):

LC_CTYPE=C grep -RLP '[^\x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
    for block in iter(lambda: f.read(4096), b''):
        if any(block):
            sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py {} \; -print

I hope that helps.

Solution 2:

You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

  • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1
  • -e . – Use . as the search pattern, i. e. match any character.
  • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s\0' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

1 From the grep(1) manual page.