Linux shell command to filter a text file by line length
I have a 30gb disk image of a borked partition (think dd if=/dev/sda1 of=diskimage
) that I need to recover some text files from. Data carving tools like foremost
only work on files with well defined headers, i.e. not plain text files, so I've fallen back on my good friend strings
.
strings diskimage > diskstrings.txt
produced a 3gb text file containing a bunch of strings, mostly useless stuff, mixed in with the text that I actually want.
Most of the cruft tends to be really long, unbroken strings of gibberish. The stuff I'm interested in is guaranteed to be less than 16kb, so I'm going to filter the file by line length. Here's the Python script I'm using to do so:
infile = open ("infile.txt" ,"r");
outfile = open ("outfile.txt","w");
for line in infile:
if len(line) < 16384:
outfile.write(line)
infile.close()
outfile.close()
This works, but for future reference: Are there any magical one-line incantations (think awk
, sed
) that would filter a file by line length?
awk '{ if (length($0) < 16384) print }' yourfile >your_output_file.txt
would print lines shorter than 16 kilobytes, as in your own example.
Or if you fancy Perl:
perl -nle 'if (length($_) < 16384) { print }' yourfile >your_output_file.txt
This is similar to Ansgar's answer, but slightly faster in my tests:
awk 'length($0) < 16384' infile >outfile
It's the same speed as the other awk answers. It relies on the implicit print
of a true expression, but doesn't need to take the time to split the line as Ansgar's does.
Note that AWK gives you an if
for free. The command above is equivalent to:
awk 'length($0) < 16384 {print}' infile >outfile
There's no explicit if
(or its surrounding set of curly braces) as in some of the other answers.
Here is a way to do it in sed
:
sed '/.\{16384\}/d' infile >outfile
or:
sed -r '/.{16384}/d' infile >outfile
which delete any line that contains 16384 (or more) characters.
For completeness, here's how you'd use sed
to save lines longer than your threshold:
sed '/^.\{0,16383\}$/d' infile >outfile
Not really different from the answers already given, but shorter still:
awk -F '' 'NF < 16384' infile >outfile
You can awk
such as:
$ awk '{ if (length($0) < 16384) { print } }' /path/to/text/file
This will print the lines longer shorter than 16K characters (16 * 1024).
You can use grep
also:
$ grep ".\{,16384\}" /path/to/text/file
This will print the lines at most 16K characters.