Linux shell command to filter a text file by line length

I have a 30gb disk image of a borked partition (think dd if=/dev/sda1 of=diskimage) that I need to recover some text files from. Data carving tools like foremost only work on files with well defined headers, i.e. not plain text files, so I've fallen back on my good friend strings.

strings diskimage > diskstrings.txt produced a 3gb text file containing a bunch of strings, mostly useless stuff, mixed in with the text that I actually want.

Most of the cruft tends to be really long, unbroken strings of gibberish. The stuff I'm interested in is guaranteed to be less than 16kb, so I'm going to filter the file by line length. Here's the Python script I'm using to do so:

infile  = open ("infile.txt" ,"r");
outfile = open ("outfile.txt","w");
for line in infile:
    if len(line) < 16384:
        outfile.write(line)
infile.close()
outfile.close()

This works, but for future reference: Are there any magical one-line incantations (think awk, sed) that would filter a file by line length?

awk '{ if (length($0) < 16384) print }' yourfile >your_output_file.txt

would print lines shorter than 16 kilobytes, as in your own example.

Or if you fancy Perl:

perl -nle 'if (length($_) < 16384) { print }' yourfile >your_output_file.txt

This is similar to Ansgar's answer, but slightly faster in my tests:

awk 'length($0) < 16384' infile >outfile

It's the same speed as the other awk answers. It relies on the implicit print of a true expression, but doesn't need to take the time to split the line as Ansgar's does.

Note that AWK gives you an if for free. The command above is equivalent to:

awk 'length($0) < 16384 {print}' infile >outfile

There's no explicit if (or its surrounding set of curly braces) as in some of the other answers.

Here is a way to do it in sed:

sed '/.\{16384\}/d' infile >outfile

or:

sed -r '/.{16384}/d' infile >outfile

which delete any line that contains 16384 (or more) characters.

For completeness, here's how you'd use sed to save lines longer than your threshold:

sed '/^.\{0,16383\}$/d' infile >outfile

Not really different from the answers already given, but shorter still:

awk -F '' 'NF < 16384' infile >outfile

You can awk such as:

$ awk '{ if (length($0) < 16384) { print } }' /path/to/text/file

This will print the lines longer shorter than 16K characters (16 * 1024).

You can use grep also:

$ grep ".\{,16384\}" /path/to/text/file

This will print the lines at most 16K characters.

Linux shell command to filter a text file by line length

Related

Recent Posts