How can I delete text that is NOT in quotes or parentheses?

Input:

19. "foo foo" (bar bar) (19) raboof
"foo foo" raboof

Expected output:

"foo foo" (bar bar) (19)
"foo foo"

As you can see, I would like to keep the double quotes and parentheses.

Everything that is not between double quotes or parentheses should be removed.


Using python:

#!/usr/bin/env python2
import re, sys
with open(sys.argv[1]) as f:
    for line in f:
        parts = line.split()
        for i in parts:
            if re.search(r'^[("].*[)"]$', i):
                print i,
        print '\n'.lstrip()

Output:

"foo" (bar) (19) 
"foo"
  • Every line is read and parts separated by spaces are saved into a list called parts

  • Then by using re module's search function we found the parts that begin with either " or ( and end with either " or ).

How to run:

Save the script as e.g. script.py. Now you can run it in two ways:

  • Make it executable by chmod u+x /path/to/script.py and run it as /path/to/script.py /path/to/file.txt i.e. input the file file.txt as the first argument. If both script and file are in the same directory, then from that directory ./script.py file.txt

  • You can run it without making it executable, run it as python2 script.py file.txt.

Answer to the edited question:

#!/usr/bin/env python2
import re, sys
with open(sys.argv[1]) as f:
    for line in f:
        print ''.join(re.findall(r'(?:(?<=\s)["(].*[")](?=\s|$)|(?<=^)["(].*[")](?=\s|$))', line))

Output:

"foo foo" (bar bar) (19)
"foo foo"

New version (spaces allowed between () or ""):

Try the below perl command (credits: @steeldriver)

perl -ne 'printf "%s\n", join(" " , $_ =~ /["(].*?[)"]/g)'

Initial version (no spaces between () or "")

You can try the following perl oneliner:

$ perl -ne '@a=split(/\s+/, $_); for (@a) {print "$_ " if /[("].*?[)"]/ };print"\n"'  file

If you (or someone else with a similar problem who reads this) don't need to preserve the newlines, the following would work:

grep -Eo '"[^"]*"|\([^)]*\)'

For input

19. "foo foo" (bar bar) (19) raboof
"foo foo" raboof

it yields output

"foo foo"
(bar bar)
(19)
"foo foo"

If you need newlines, you can use some tricks, e.g. this:

sed 's/$/\$/' \
| grep -Eo '"[^"]*"|\([^)]*\)|\$$' \
| tr '\n$' ' \n' \
| sed 's/^ //'

The first sed adds a $ to the end of every line. (You could use any character for this.) The second is almost the same grep as above, but now also matches $ at the end of a line, so it matches every end of line. The tr turns newlines into spaces, and dollars into newlines. But since the output before that tr had $ followed by newline, the output after will have newline followed by space. The final sed gets rid of those spaces.


Another python option:

#!/usr/bin/env python3
import sys
match = lambda ch1, ch2, w: all([w.startswith(ch1), w.endswith(ch2)])

for l in open(sys.argv[1]).read().splitlines():
    matches = [w for w in l.split() if any([match("(", ")", w), match('"', '"', w)])]
    print((" ").join(matches))
  • Copy the script into an empty file, save the script as filter.py

  • Run it with the command:

    python3 /path/to/filter.py <file>
    

On the edited version of the question:

If we assume there is a closing character on every opening character: '(' and '"' (we should assume that, since otherwise either the file would be incorrect or the question would have to mention a more complex set of rules in case of "nested" parentheses or quotes), the code below should do the job as well:

#!/usr/bin/env python3
import sys
chunks = lambda l: [l[i:i + 2] for i in range(0, len(l), 2)]

for l in open(sys.argv[1]).read().splitlines():
    words = chunks([i for i in range(len(l)) if l[i] in ['(', ')', '"']])
    print((" ").join([l[w[0]:w[1]+1] for w in words]))

It lists characters in the list: ['(', ')', '"'], makes chunks of two out of the found matches and prints what is in the range of each couple:

19. "foo" (bar bar) (blub blub blub blub) (19) raboof
"foo" raboof

will then output:

"foo" (bar bar) (blub blub blub blub) (19)
"foo"

The use is exactly like the first script.

More or other "triggers" can be easily added by adding both sides (start- and end character of the string or section to "keep") in the list:

['(', ')', '"']

in the line:

words = chunks([i for i in range(len(l)) if l[i] in ['(', ')', '"']])