Bash tool to get nth line from a file
Solution 1:
head
and pipe with tail
will be slow for a huge file. I would suggest sed
like this:
sed 'NUMq;d' file
Where NUM
is the number of the line you want to print; so, for example, sed '10q;d' file
to print the 10th line of file
.
Explanation:
NUMq
will quit immediately when the line number is NUM
.
d
will delete the line instead of printing it; this is inhibited on the last line because the q
causes the rest of the script to be skipped when quitting.
If you have NUM
in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file
Solution 2:
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
Solution 3:
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
- I need to extract only a subset of the rows to do anything useful with the data.
- Reading through every row leading up to the values I care about is going to take a long time.
- If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time
built-in to benchmark each command.
Baseline
First let's see how the head
tail
solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit
because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head
tail
solution. At best the sed
solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100
)
Row 50,000,000
- 00:01:12.705 (-00:00:02.616 = -3.47%)
sed
- 00:01:13.146 (-00:00:02.175 = -2.89%)
perl
- 00:01:15.321 (+00:00:00.000 = +0.00%)
head|tail
- 00:01:16.583 (+00:00:01.262 = +1.68%)
awk
- 00:05:12.156 (+00:03:56.835 = +314.43%)
cut
Row 500,000,000
- 00:12:07.050 (-00:00:26.160)
sed
- 00:12:11.460 (-00:00:21.750)
perl
- 00:12:33.210 (+00:00:00.000)
head|tail
- 00:12:45.830 (+00:00:12.620)
awk
- 00:52:01.560 (+00:40:31.650)
cut
Row 3,338,559,320
- 01:20:54.599 (-00:03:05.327)
sed
- 01:21:24.045 (-00:02:25.227)
perl
- 01:23:49.273 (+00:00:00.000)
head|tail
- 01:25:13.548 (+00:02:35.735)
awk
- 05:47:23.026 (+04:24:26.246)
cut
Solution 4:
With awk
it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk
is performed: {print $0}
.
Alternative versions
If your file happens to be huge, you'd better exit
after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit
, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.