Which is faster to delete first line in file... sed or tail?
In this answer (How can I remove the first line of a file with sed?) there are two ways to delete the first record in a file:
sed '1d' $file >> headerless.txt
** ---------------- OR ----------------**
tail -n +2 $file >> headerless.txt
Personally I think the tail
option is cosmetically more pleasing and more readable but probably because I'm sed-challenged.
Which method is fastest?
Performance of sed
vs. tail
to remove the first line of a file
TL;DR
sed
is very powerful and versatile, but this is what makes it slow, especially for large files with many lines.tail
does just one simple thing, but that one it does well and fast, even for bigger files with many lines.
For small and medium sized files, sed
and tail
are performing similarly fast (or slow, depending on your expectations). However, for larger input files (multiple MBs), the performance difference grows significantly (an order of magnitude for files in the range of hundreds of MBs), with tail
clearly outperforming sed
.
Experiment
General Preparations:
Our commands to analyze are:
sed '1d' testfile > /dev/null
tail -n +2 testfile > /dev/null
Note that I'm piping the output to /dev/null
each time to eliminate the terminal output or file writes as performance bottleneck.
Let's set up a RAM disk to eliminate disk I/O as potential bottleneck. I personally have a tmpfs
mounted at /tmp
so I simply placed my testfile
there for this experiment.
Then I am once creating a random test file containing a specified amount of lines $numoflines
with random line length and random data using this command (note that it's definitely not optimal, it becomes really slow for about >2M lines, but who cares, it's not the thing we're analyzing):
cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n "$numoflines" > testfile
Oh, btw. my test laptop is running Ubuntu 16.04, 64 bit on an Intel i5-6200U CPU. Just for comparison.
Timing big files:
Setting up a huge testfile
:
Running the command above with numoflines=10000000
produced a random file containing 10M lines, occupying a bit over 600 MB - it's quite huge, but let's start with it, because we can:
$ wc -l testfile
10000000 testfile
$ du -h testfile
611M testfile
$ head -n 3 testfile
qOWrzWppWJxx0e59o2uuvkrfjQbzos8Z0RWcCQPMGFPueRKqoy1mpgjHcSgtsRXLrZ8S4CU8w6O6pxkKa3JbJD7QNyiHb4o95TSKkdTBYs8uUOCRKPu6BbvG
NklpTCRzUgZK
O/lcQwmJXl1CGr5vQAbpM7TRNkx6XusYrO
Perform the timed run with our huge testfile
:
Now let's do just a single timed run with both commands first to estimate with what magnitudes we're working.
$ time sed '1d' testfile > /dev/null
real 0m2.104s
user 0m1.944s
sys 0m0.156s
$ time tail -n +2 testfile > /dev/null
real 0m0.181s
user 0m0.044s
sys 0m0.132s
We already see a really clear result for big files, tail
is a magnitude faster than sed
. But just for fun and to be sure there are no random side effects making a big difference, let's do it 100 times:
$ time for i in {1..100}; do sed '1d' testfile > /dev/null; done
real 3m36.756s
user 3m19.756s
sys 0m15.792s
$ time for i in {1..100}; do tail -n +2 testfile > /dev/null; done
real 0m14.573s
user 0m1.876s
sys 0m12.420s
The conclusion stays the same, sed
is inefficient to remove the first line of a big file, tail
should be used there.
And yes, I know Bash's loop constructs are slow, but we're only doing relatively few iterations here and the time a plain loop takes is not significant compared to the sed
/tail
runtimes anyway.
Timing small files:
Setting up a small testfile
:
Now for completeness, let's look at the more common case that you have a small input file in the kB range. Let's create a random input file with numoflines=100
, looking like this:
$ wc -l testfile
100 testfile
$ du -h testfile
8,0K testfile
$ head -n 3 testfile
tYMWxhi7GqV0DjWd
pemd0y3NgfBK4G4ho/
aItY/8crld2tZvsU5ly
Perform the timed run with our small testfile
:
As we can expect the timings for such small files to be in the range of a few milliseconds from experience, let's just do 1000 iterations right away:
$ time for i in {1..1000}; do sed '1d' testfile > /dev/null; done
real 0m7.811s
user 0m0.412s
sys 0m7.020s
$ time for i in {1..1000}; do tail -n +2 testfile > /dev/null; done
real 0m7.485s
user 0m0.292s
sys 0m6.020s
As you can see, the timings are quite similar, there's not much to interpret or wonder about. For small files, both tools are equally well suited.
Here's another alternative, using just bash builtins and cat
:
{ read ; cat > headerless.txt; } < $file
$file
is redirected into the { }
command grouping. The read
simply reads and discards the first line. The rest of the stream is then piped to cat
which writes it to the destination file.
On my Ubuntu 16.04 the performance of this and the tail
solution are very similar. I created a largish test file with seq
:
$ seq 100000000 > 100M.txt
$ ls -l 100M.txt
-rw-rw-r-- 1 ubuntu ubuntu 888888898 Dec 20 17:04 100M.txt
$
tail
solution:
$ time tail -n +2 100M.txt > headerless.txt
real 0m1.469s
user 0m0.052s
sys 0m0.784s
$
cat
/brace solution:
$ time { read ; cat > headerless.txt; } < 100M.txt
real 0m1.877s
user 0m0.000s
sys 0m0.736s
$
I only have an Ubuntu VM handy right now though, and saw significant variation in the timings of both, though they're all in the same ballpark.
Trying in on my system, and prefixing each command with time
I got the following results:
sed:
real 0m0.129s
user 0m0.012s
sys 0m0.000s
and tail:
real 0m0.003s
user 0m0.000s
sys 0m0.000s
which suggest that, on my system at least AMD FX 8250 running Ubuntu 16.04, tail is significantly faster. The test file had 10,000 lines with a size of 540k. The file was read from a HDD.
There is no objective way to say which is better, because sed
and tail
aren't the only things that run on a system during program execution. A lot of factors such as disk i/o, network i/o, CPU interrupts for higher priority processes - all those influence how fast your program will run.
Both of them are written in C, so this is not language issue, but more of environmental one. For example, I have SSD and on my system this will take time in microseconds, but for same file on hard drive it will take more time because HDDs are significantly slower. So hardware plays role in this,too.
There's a few things that you may want to keep in mind when considering which command to choose:
- What is your purpose ?
sed
is stream editor for transforming text.tail
is for outputting specific lines of text. If you want to deal with lines and only print them out , usetail
. If you want to edit the text, usesed
. -
tail
has far simpler syntax thansed
, so use what you can read yourself and what others can read.
Another important factor is the amount of data you're processing. Small files won't give you any performance difference. The picture gets interesting when you're dealing with big files. With a 2 GB BIGFILE.txt, we can see that sed
has far more system calls than tail
, and runs considerably slower.
bash-4.3$ du -sh BIGFILE.txt
2.0G BIGFILE.txt
bash-4.3$ strace -c sed '1d' ./BIGFILE.txt > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.38 0.079781 0 517051 read
40.62 0.054570 0 517042 write
0.00 0.000000 0 10 1 open
0.00 0.000000 0 11 close
0.00 0.000000 0 10 fstat
0.00 0.000000 0 19 mmap
0.00 0.000000 0 12 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 1 ioctl
0.00 0.000000 0 7 7 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 2 2 statfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.134351 1034177 11 total
bash-4.3$ strace -c tail -n +2 ./BIGFILE.txt > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
62.30 0.148821 0 517042 write
37.70 0.090044 0 258525 read
0.00 0.000000 0 9 3 open
0.00 0.000000 0 8 close
0.00 0.000000 0 7 fstat
0.00 0.000000 0 10 mmap
0.00 0.000000 0 4 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 1 ioctl
0.00 0.000000 0 3 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.238865 775615 7 total