How can I extract records from a file if they include a specific set of strings?

I am analyzing a file xyz.txt which contains hifen separated records. I want to extract records based on the presence of the strings FADED:100, AM:FF and GG. Subsequently, I need to write them to a new file, faded100.txt. The source file include more then 40 thousand records, looking like below.

--- --------  --- -- -- -- -- -- -- -- -- -- -- -- -- -- --  --- --- -- --

    rtuyss  A/A go go go go go go go go go go go go go go  IRE AP  QQ Z
ORDER xxxxxxx1

country: 201  NVDS        TEMPROR   EXTREME

BUS TIME:       TRASS: 12       AIDED: 12        FADED: 100

                      U  U  U  u  U  A  U  O  O  O  O  O  O  O
                 GG   Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU
ATM UNITED #  AM:FF   Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU POU POU POU
--- --------  --- -- -- -- -- -- -- -- -- -- -- -- -- -- --  --- --- -- --

rtuyss  A/A go go go go go go go go go go go go go go  IRE AP  QQ Z
ORDER xxxxxxx1

country: 201  NVDS        TEMPROR   EXTREME

BUS TIME:       TRASS: 12       AIDED: 12        FADED: 200

                      U  U  U  u  U  A  U  O  O  O  O  O  O  O
                  ZZ  Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU
ATM UNITED #   AM:FF  Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU POU POU POU

--- --------  --- -- -- -- -- -- -- -- -- -- -- -- -- -- --  --- --- -- --

rtuyss  A/A go go go go go go go go go go go go go go  IRE AP  QQ Z
ORDER xxxxxxx1

country: 201  NVDS        TEMPROR   EXTREME

BUS TIME:       TRASS: 12       AIDED: 12        FADED: 100

                     U  U  U  u  U  A  U  O  O  O  O  O  O  O
                  IP Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU
ATM UNITED #   AM:FF Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU POU POU POU

--- --------  --- -- -- -- -- -- -- -- -- -- -- -- -- -- --  --- --- -- --

Solution 1:

Extracting records from a text file if a set of strings can be found in the record

The script below will extract records from your file, if they meet the conditions you described in your question. If the order is important, please see my notes in If the order is important.

Since the script reads the file per line, and subsequently processes the lines per record, it should be pretty fast on big files.

The script

#!/usr/bin/env python3
import sys

#--- set the strings to be (needed to be) found below
checks = ["FADED: 100", "AM:FF", "GG"]
#---

f = sys.argv[1]; out = sys.argv[2]; rec = []; test = []

with open(f) as src, open(out, "a+") as targ:
    for l in src:
        rec.append(l)
        if l.startswith("---"):
            if len(test) == 3:
                for l in rec:
                    targ.write(l)
            rec = []; test = []
        else:           
            for s in checks:
                if s in l:
                    test.append(checks.index(s))
                    break

What the script does exactly

  • The script reads a record (loading per line), keeps at the same time record if any of the strings ["FADED: 100", "AM:FF", "GG"] occurs in the line.
  • If not all three strings occur in the record, the record is deleted from "cache", the next record is loaded and so on

If the order is important

If it is important in which order (lines containing-) the strings appear inside your record, you can replace the line:

if len(test) == 3:

by:

if test == [0, 2, 1]

where the numbers refer to the indices of the strings in the list checks = ["FADED: 100", "AM:FF", "GG"] (where 0 is the first string)

How to use

  1. Copy the script into an empty file, save it as filter_records.py
  2. Run the script with the source (file with your current records) and the output file as arguments, e.g.:

    python3 /path/to/filter_records.py /path/to/inputfile.txt /path/to/outputfile.txt 
    

The result (on your small example)

    rtuyss  A/A go go go go go go go go go go go go go go  IRE AP  QQ Z
ORDER xxxxxxx1

country: 201  NVDS        TEMPROR   EXTREME

BUS TIME:       TRASS: 12       AIDED: 12        FADED: 100

                      U  U  U  u  U  A  U  O  O  O  O  O  O  O
                 GG   Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU
ATM UNITED #  AM:FF   Y  Y  Y  Y  Y  O  Y  O  O  O  O  O  O  O   POU POU POU POU
--- --------  --- -- -- -- -- -- -- -- -- -- -- -- -- -- --  --- --- -- --