How can I delete text that is NOT in quotes or parentheses?
Input:
19. "foo foo" (bar bar) (19) raboof
"foo foo" raboof
Expected output:
"foo foo" (bar bar) (19)
"foo foo"
As you can see, I would like to keep the double quotes and parentheses.
Everything that is not between double quotes or parentheses should be removed.
Using python
:
#!/usr/bin/env python2
import re, sys
with open(sys.argv[1]) as f:
for line in f:
parts = line.split()
for i in parts:
if re.search(r'^[("].*[)"]$', i):
print i,
print '\n'.lstrip()
Output:
"foo" (bar) (19)
"foo"
Every line is read and parts separated by spaces are saved into a list called
parts
Then by using
re
module'ssearch
function we found the parts that begin with either"
or(
and end with either"
or)
.
How to run:
Save the script as e.g. script.py
. Now you can run it in two ways:
Make it executable by
chmod u+x /path/to/script.py
and run it as/path/to/script.py /path/to/file.txt
i.e. input the filefile.txt
as the first argument. If both script and file are in the same directory, then from that directory./script.py file.txt
You can run it without making it executable, run it as
python2 script.py file.txt
.
Answer to the edited question:
#!/usr/bin/env python2
import re, sys
with open(sys.argv[1]) as f:
for line in f:
print ''.join(re.findall(r'(?:(?<=\s)["(].*[")](?=\s|$)|(?<=^)["(].*[")](?=\s|$))', line))
Output:
"foo foo" (bar bar) (19)
"foo foo"
New version (spaces allowed between ()
or ""
):
Try the below perl
command (credits: @steeldriver)
perl -ne 'printf "%s\n", join(" " , $_ =~ /["(].*?[)"]/g)'
Initial version (no spaces between ()
or ""
)
You can try the following perl
oneliner:
$ perl -ne '@a=split(/\s+/, $_); for (@a) {print "$_ " if /[("].*?[)"]/ };print"\n"' file
If you (or someone else with a similar problem who reads this) don't need to preserve the newlines, the following would work:
grep -Eo '"[^"]*"|\([^)]*\)'
For input
19. "foo foo" (bar bar) (19) raboof
"foo foo" raboof
it yields output
"foo foo"
(bar bar)
(19)
"foo foo"
If you need newlines, you can use some tricks, e.g. this:
sed 's/$/\$/' \
| grep -Eo '"[^"]*"|\([^)]*\)|\$$' \
| tr '\n$' ' \n' \
| sed 's/^ //'
The first sed
adds a $
to the end of every line. (You could use any character for this.) The second is almost the same grep
as above, but now also matches $
at the end of a line, so it matches every end of line. The tr
turns newlines into spaces, and dollars into newlines. But since the output before that tr
had $
followed by newline, the output after will have newline followed by space. The final sed
gets rid of those spaces.
Another python option:
#!/usr/bin/env python3
import sys
match = lambda ch1, ch2, w: all([w.startswith(ch1), w.endswith(ch2)])
for l in open(sys.argv[1]).read().splitlines():
matches = [w for w in l.split() if any([match("(", ")", w), match('"', '"', w)])]
print((" ").join(matches))
Copy the script into an empty file, save the script as
filter.py
-
Run it with the command:
python3 /path/to/filter.py <file>
On the edited version of the question:
If we assume there is a closing character on every opening character: '('
and '"'
(we should assume that, since otherwise either the file would be incorrect or the question would have to mention a more complex set of rules in case of "nested" parentheses or quotes), the code below should do the job as well:
#!/usr/bin/env python3
import sys
chunks = lambda l: [l[i:i + 2] for i in range(0, len(l), 2)]
for l in open(sys.argv[1]).read().splitlines():
words = chunks([i for i in range(len(l)) if l[i] in ['(', ')', '"']])
print((" ").join([l[w[0]:w[1]+1] for w in words]))
It lists characters in the list: ['(', ')', '"']
, makes chunks of two out of the found matches and prints what is in the range of each couple:
19. "foo" (bar bar) (blub blub blub blub) (19) raboof
"foo" raboof
will then output:
"foo" (bar bar) (blub blub blub blub) (19)
"foo"
The use is exactly like the first script.
More or other "triggers" can be easily added by adding both sides (start- and end character of the string or section to "keep") in the list:
['(', ')', '"']
in the line:
words = chunks([i for i in range(len(l)) if l[i] in ['(', ')', '"']])