Extract multiple lines from .txt file with sed or awk using a second .txt file with the line numbers to be extracted
I am trying to extract multiple lines from a raw_data.txt file having 200x10E6 lines. The line numbers to be extract (>5000) are listed in a second .txt named lines.txt (one line number per row).
Based on what I found out here, I tried two approaches with awk and sed:
awk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt raw_data.txt > selected_data.txt
and
sed ’s/$/p/‘ lines.txt | sed -n -f - raw_data.txt > selected_data.txt
In both cases, the selected_data.txt file was empty. I am assuming that the large number of lines to be selected and the very large number of lines in raw_data.txt are preventing the correct execution since both commands work when I selected very few lines only (<5).
Any idea to solve this problem? Thanks.
Suppose you have these two files:
cat lines
1
5
6
12
cat file.txt
line 1
line 2
line 3
...
line 23
line 24
line 25
You can read lines
first then use that to decide what line to print in file.txt
like so:
awk 'FNR==NR{line[$1]; next}
FNR in line' lines file.txt
line 1
line 5
line 6
line 12
The reason this might not be working on your computer is often the line endings are not what awk
expects.
Try this:
awk '{printf("%s: %s\n", FNR, $1)}' lines
1: 1
2: 5
3: 6
4: 12
You can also use the Unix file
utility which will show one of:
file file.txt
file.txt: ASCII text
Or:
file file.txt
file.txt: ASCII text, with CRLF line terminators
If your awk is expecting \r\n
and only gets \n
that may be your issue.
Use dos2unix
or unix2dos
to fix that. Or set the appropriate RS=<what ever your line endings are>
in awk. If you have GNU awk, you can do RS="\r?\n"
and it will work with both DOS and Unix line endings.