Extract multiple lines from .txt file with sed or awk using a second .txt file with the line numbers to be extracted

I am trying to extract multiple lines from a raw_data.txt file having 200x10E6 lines. The line numbers to be extract (>5000) are listed in a second .txt named lines.txt (one line number per row).

Based on what I found out here, I tried two approaches with awk and sed:

awk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt raw_data.txt > selected_data.txt

and

sed ’s/$/p/‘ lines.txt | sed -n -f - raw_data.txt > selected_data.txt

In both cases, the selected_data.txt file was empty. I am assuming that the large number of lines to be selected and the very large number of lines in raw_data.txt are preventing the correct execution since both commands work when I selected very few lines only (<5).

Any idea to solve this problem? Thanks.

Suppose you have these two files:

cat lines
1
5
6
12

cat file.txt
line 1
line 2
line 3
...
line 23
line 24
line 25

You can read lines first then use that to decide what line to print in file.txt like so:

awk 'FNR==NR{line[$1]; next} 
FNR in line' lines file.txt
line 1
line 5
line 6
line 12

The reason this might not be working on your computer is often the line endings are not what awk expects.

Try this:

awk '{printf("%s: %s\n", FNR, $1)}' lines
1: 1
2: 5
3: 6
4: 12

You can also use the Unix file utility which will show one of:

file file.txt
file.txt: ASCII text

Or:

file file.txt
file.txt: ASCII text, with CRLF line terminators

If your awk is expecting \r\n and only gets \n that may be your issue.

Use dos2unix or unix2dos to fix that. Or set the appropriate RS=<what ever your line endings are> in awk. If you have GNU awk, you can do RS="\r?\n" and it will work with both DOS and Unix line endings.

Extract multiple lines from .txt file with sed or awk using a second .txt file with the line numbers to be extracted

Related

Recent Posts