Awk command to print all the lines except the last three lines

I want to print all the lines except the last three lines from the input through awk only. Please note that my file contains n number of lines.

For example,

file.txt contains,

foo
bar
foobar
barfoo
last
line

I want the output to be,

foo
bar
foobar

I know it could be possible through the combination of tac and sed or tac and awk

$ tac file | sed '1,3d' | tac
foo
bar
foobar

$ tac file | awk 'NR==1{next}NR==2{next}NR==3{next}1' | tac
foo
bar
foobar

But i want the output through awk only.

Solution 1:

It's ever-so clunky but you can add every line to an array and at the end —when you know the length— output everything but the last 3 lines.

... | awk '{l[NR] = $0} END {for (i=1; i<=NR-3; i++) print l[i]}'

Another (more efficient here) approach is manually stacking in three variables:

... | awk '{if (a) print a; a=b; b=c; c=$0}'

a only prints after a line has moved from c to b and then into a so this limits it to three lines. The immediate upsides are it doesn't store all the content in memory and it shouldn't cause buffering issues (fflush() after printing if it does) but the downside here is it's not simple to scale this up. If you want to skip the last 100 lines, you need 100 variables and 100 variable juggles.

If awk had push and pop operators for arrays, it would be easier.

Or we could pre-calculate the number of lines and how far we actually want to go with $(($(wc -l < file) - 3)). This is relatively useless for streamed content but on a file, works pretty well:

awk -v n=$(($(wc -l < file) - 3)) 'NR<n' file

Typically speaking you'd just use head though:

$ seq 6 | head -n-3
1
2
3

Using terdon's benchmark we can actually see how these compare. I thought I'd offer a full comparison though:

head: 0.018s (me)
awk + wc: 0.169s (me)
awk 3 variables: 0.178s (me)
awk double-file: 0.322s (terdon)
awk circular buffer: 0.355s (Scrutinizer)
awk for-loop: 0.693s (me)

The fastest solution is using a C-optimised utility like head or wc handle the heavy lifting things but in pure awk, the manually rotating stack is king for now.

Solution 2:

For minimal memory usage, you could use a circular buffer:

awk 'NR>n{print A[NR%n]} {A[NR%n]=$0}' n=3 file

By using the mod operator on line numbers we have at most n array entries.

Taking the example of n=3:

On line 1 NR%n equals 1, line 2 produces 2 and line 3 produces 0 and line 4 evaluates to 1 again..

Line 1 -> A[1]
Line 2 -> A[2]
Line 3 -> A[0]
Line 4 -> A[1]
Line 5 -> A[2]
...

When we get to line 4, A[NR%n] contains the content of line 1. So that gets printed and A[NR%n] gets the content of line 4. The next line (line 5) the original content of line 2 gets printed and so on, until we get to the end. What remains unprinted is the content of the buffer, which contains the last 3 lines...

Solution 3:

You can also process the file twice to avoid keeping anything in memory:

awk '{if(NR==FNR){c++}else if(FNR<=c-3){print}}' file file

The trick here is the NR==FNR test. NR is the current line number and FNR is the current line number of the current file. If more than one file is passed as input, FNR will be equal to NR only while the first file is being processed. This way, we quickly get the number of lines in the first file and save it as c. Since the "two" files are actually the same one, we now know the number of lines we want so we only print if this is one of them.

While you might think this will be slower than the other approaches, it is actually faster since there is next to no processing going on. Everything is done using internal awk tools (NR and FNR) apart from a single arithmetic comparison. I tested on a 50MB file with one million lines created with this command:

for i in {500000..1000000}; do 
    echo "The quick brown fox jumped over the lazy dog $i" >> file; 
done

As you can see, the times are almost identical but the approach I provided here is marginally faster Oli's first suggestion (but slower than the others):

$ for i in {1..10}; do ( 
    time awk '{if(NR==FNR){c++}else if(FNR<=c-3){print}}' file file > /dev/null ) 2>&1 | 
       grep -oP 'real.*?m\K[\d\.]+'; 
  done | awk '{k+=$1}END{print k/10" seconds"}'; 
0.4757 seconds

$  for i in {1..10}; do ( 
    time awk '{l[NR] = $0} END {for (i=1; i<=NR-3; i++) print l[i]}' file > /dev/null ) 2>&1 | 
        grep -oP 'real.*?m\K[\d\.]+'; 
   done | awk '{k+=$1}END{print k/10" seconds"}'; 
0.5347 seconds

Awk command to print all the lines except the last three lines

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts