How do I remove all lines in a file that are less than 6 characters?
Solution 1:
There are many ways to do this.
Using grep
:
grep -E '^.{6,}$' file.txt >out.txt
Now out.txt
will contain lines having six or more characters.
Reverse way:
grep -vE '^.{,5}$' file.txt >out.txt
Using sed
, removing lines of length 5 or less:
sed -r '/^.{,5}$/d' file.txt
Reverse way, printing lines of length six or more:
sed -nr '/^.{6,}$/p' file.txt
You can save the output in a different file using >
operator like grep
or edit the file in-place using -i
option of sed
:
sed -ri.bak '/^.{6,}$/' file.txt
The original file will be backed up as file.txt.bak
and the modified file will be file.txt
.
If you do not want to keep a backup:
sed -ri '/^.{6,}$/' file.txt
Using shell, Slower, Don't do this, this is just for the sake of showing another method:
while IFS= read -r line; do [ "${#line}" -ge 6 ] && echo "$line"; done <file.txt
Using python
,even slower than grep
, sed
:
#!/usr/bin/env python2
with open('file.txt') as f:
for line in f:
if len(line.rstrip('\n')) >= 6:
print line.rstrip('\n')
Better use list comprehension to be more Pythonic:
#!/usr/bin/env python2
with open('file.txt') as f:
strip = str.rstrip
print '\n'.join([line for line in f if len(strip(line, '\n')) >= 6]).rstrip('\n')
Solution 2:
It's very simple:
grep ...... inputfile > resultfile #There are 6 dots
This is extremely efficient, as grep
will not try to parse more than it needs, nor to interpret the chars in any way: it simply send a (whole) line to stdout (which the shell then redirects to resultfile) as soon as it saw 6 chars on that line (.
in a regexp context matches any 1 character).
So grep will only output lines having 6 (or more) chars, and the other ones are not outputted by grep so they don't make it to resultfile.
Solution 3:
Solution #1: using C
Fastest way: compile and run this C program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_BUFFER_SIZE 1000000
int main(int argc, char *argv[]) {
int length;
if(argc == 3)
length = atoi(argv[2]);
else
return 1;
FILE *file = fopen(argv[1], "r");
if(file != NULL) {
char line[MAX_BUFFER_SIZE];
while(fgets(line, sizeof line, file) != NULL) {
char *pos;
if((pos = strchr(line, '\n')) != NULL)
*pos = '\0';
if(strlen(line) >= length)
printf("%s\n", line);
}
fclose(file);
}
else {
perror(argv[1]);
return 1;
}
return 0;
}
Compile with gcc program.c -o program
, run with ./program file line_length
(where file
= path to the file and line_length
= minimum line length, in your case 6
; the maximum line length is limited to 1000000
characters per line; you can change this by changing the value of MAX_BUFFER_SIZE
).
(Trick to substitute \n
with \0
found here.)
Comparison with all the other solutions proposed to this question except the shell solution (test run on a ~91MB file with 10M lines with an average lenght of 8 characters):
time ./foo file 6
real 0m1.592s
user 0m0.712s
sys 0m0.160s
time grep ...... file
real 0m1.945s
user 0m0.912s
sys 0m0.176s
time grep -E '^.{6,}$'
real 0m2.178s
user 0m1.124s
sys 0m0.152s
time awk 'length>=6' file
real 0m2.261s
user 0m1.228s
sys 0m0.160s
time perl -lne 'length>=6&&print' file
real 0m4.252s
user 0m3.220s
sys 0m0.164s
sed -r '/^.{,5}$/d' file >out
real 0m7.947s
user 0m7.064s
sys 0m0.120s
./script.py >out
real 0m8.154s
user 0m7.184s
sys 0m0.164s
Solution #2: using AWK:
awk 'length>=6' file
-
length>=6
: iflength>=6
returns TRUE, prints the current record.
Solution #3: using Perl:
perl -lne 'length>=6&&print' file
- If
lenght>=6
returns TRUE, prints the current record.
% cat file
a
bb
ccc
dddd
eeeee
ffffff
ggggggg
% ./foo file 6
ffffff
ggggggg
% awk 'length>=6' file
ffffff
ggggggg
% perl -lne 'length>=6&&print' file
ffffff
ggggggg