How can I extract records from a file if they include a specific set of strings?
I am analyzing a file xyz.txt
which contains hifen separated records. I want to extract records based on the presence of the strings FADED:100
, AM:FF
and GG
. Subsequently, I need to write them to a new file, faded100.txt
. The source file include more then 40 thousand records, looking like below.
--- -------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --- --- -- -- rtuyss A/A go go go go go go go go go go go go go go IRE AP QQ Z ORDER xxxxxxx1 country: 201 NVDS TEMPROR EXTREME BUS TIME: TRASS: 12 AIDED: 12 FADED: 100 U U U u U A U O O O O O O O GG Y Y Y Y Y O Y O O O O O O O POU ATM UNITED # AM:FF Y Y Y Y Y O Y O O O O O O O POU POU POU POU --- -------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --- --- -- -- rtuyss A/A go go go go go go go go go go go go go go IRE AP QQ Z ORDER xxxxxxx1 country: 201 NVDS TEMPROR EXTREME BUS TIME: TRASS: 12 AIDED: 12 FADED: 200 U U U u U A U O O O O O O O ZZ Y Y Y Y Y O Y O O O O O O O POU ATM UNITED # AM:FF Y Y Y Y Y O Y O O O O O O O POU POU POU POU --- -------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --- --- -- -- rtuyss A/A go go go go go go go go go go go go go go IRE AP QQ Z ORDER xxxxxxx1 country: 201 NVDS TEMPROR EXTREME BUS TIME: TRASS: 12 AIDED: 12 FADED: 100 U U U u U A U O O O O O O O IP Y Y Y Y Y O Y O O O O O O O POU ATM UNITED # AM:FF Y Y Y Y Y O Y O O O O O O O POU POU POU POU --- -------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --- --- -- --
Solution 1:
Extracting records from a text file if a set of strings can be found in the record
The script below will extract records from your file, if they meet the conditions you described in your question. If the order is important, please see my notes in If the order is important.
Since the script reads the file per line, and subsequently processes the lines per record, it should be pretty fast on big files.
The script
#!/usr/bin/env python3
import sys
#--- set the strings to be (needed to be) found below
checks = ["FADED: 100", "AM:FF", "GG"]
#---
f = sys.argv[1]; out = sys.argv[2]; rec = []; test = []
with open(f) as src, open(out, "a+") as targ:
for l in src:
rec.append(l)
if l.startswith("---"):
if len(test) == 3:
for l in rec:
targ.write(l)
rec = []; test = []
else:
for s in checks:
if s in l:
test.append(checks.index(s))
break
What the script does exactly
- The script reads a record (loading per line), keeps at the same time record if any of the strings
["FADED: 100", "AM:FF", "GG"]
occurs in the line. - If not all three strings occur in the record, the record is deleted from "cache", the next record is loaded and so on
If the order is important
If it is important in which order (lines containing-) the strings appear inside your record, you can replace the line:
if len(test) == 3:
by:
if test == [0, 2, 1]
where the numbers refer to the indices of the strings in the list checks = ["FADED: 100", "AM:FF", "GG"]
(where 0 is the first string)
How to use
- Copy the script into an empty file, save it as
filter_records.py
-
Run the script with the source (file with your current records) and the output file as arguments, e.g.:
python3 /path/to/filter_records.py /path/to/inputfile.txt /path/to/outputfile.txt
The result (on your small example)
rtuyss A/A go go go go go go go go go go go go go go IRE AP QQ Z
ORDER xxxxxxx1
country: 201 NVDS TEMPROR EXTREME
BUS TIME: TRASS: 12 AIDED: 12 FADED: 100
U U U u U A U O O O O O O O
GG Y Y Y Y Y O Y O O O O O O O POU
ATM UNITED # AM:FF Y Y Y Y Y O Y O O O O O O O POU POU POU POU
--- -------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --- --- -- --