Series of sed commands work on command line, but not in a script
Using cat -v
to turn CR characters into literal ^M
sequences seems fundamentally ugly to me - if you need to remove DOS line endings, use dos2unix
, tr
, or sed 's/\r$//
'
If you insist on using sed, then I suggest you print the bits you do want, rather than trying to delete all the random bits you don't - for example
$ sed -rn -e 's/\"//g' -e 's/(.*): (.*)\r/\2/p' QueryR | paste -d '' - -
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu
You could get fancy and roll the quote removal into the key-value extraction by matching zero or more quotes at each end of the value sequence
$ sed -rn 's/(.*): \"*([^"]*)\"*\r/\2/p' QueryR | paste -d '' - -
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu
You could get really fancy and emulate the paste
in sed
by first joining pairs of lines on the ,\r$
ending and then matching the key-value pairs multiply (g
) and non-greedily
$ sed -rn '/,\r$/ {N; s/([^:]*): \"*([^:"]*)\"*\r\n?/\2/gp}' QueryR
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu
(Personally I'd favor the KISS approach and use the first one).
FWIW, since your input appears to be over-quoted JSON, I'd suggest installing a proper JSON parser such as jq
sudo apt-get install jq
You can then do something like
$ sed -e 's/["]["]/"/g' -e 's/"{/{/' -e 's/}"/}/' QueryR | jq '.id, .title' | paste -d, - -
281952,"Flash 11.2 No Longer Supported by Google Play"
281993,"Netbeans won't open in Ubuntu"
which removes the superfluous quotes and then uses jq
to extract the fields of interest - note that jq
seems to handle the DOS-style line endings, so there's no need to to take special steps to remove those.
Change to jq '.[]'
to dump all the attribute-value pairs.
Credit for inspiration and basic jq
syntax taken from Overcoming newlines with grep -o
I fixed it thanks to steeldriver & further tinkering. Unrefined but works.
sed '{
s/"{//
s/}"//
s/^"//
/,\r/{N;/\n.*title.*:\s/{s/,\r\n.*title.*:\s/,/}}
s/""//g
s/^\s\+//
/^\s*$/d
s/^id:\ //
s/\\//g
}' QueryR* | tee "$1"
translation:s/"{//
Remove "{
s/}"//
Remove }"
s/^"//
Remove "
from start of line/,\r/{N;/\n.*title.*:\s/{s/,\r\n.*title.*:\s/,\ /}}
match ,\r
on one line and [whatever]title[whatever]:
on the next line, replace all that with ,
s/""//g
Remove all the remaining double double quotess/^\s\+//
Remove whitespace from start of lines/^\s*$/d
Remove empty liness/^id:\ //
Remove id:
and space after its/\\//g
Remove backslashes (escape chars for " added to some title fields)tee "$1"
specify an outfile when running the script, for example ./queryclean newquery.csv
While the question asks for sed
, one could work around sed's issues with Python:
from __future__ import print_function
import sys
with open(sys.argv[1]) as f:
for line in f:
if '""id""' in line:
print(line.strip().split(':')[1],end="")
if '""title""' in line:
title = " ".join(line.strip().split(':')[1:])
print(title.replace('""'," "))
This code is compliant with both python2 and python3 , so either will work
Sample run:
bash-4.3$ cat questions.txt
"{
""id"": 281952,
""title"": ""Flash 11.2 No Longer Supported by Google Play""
}"
"{
""id"": 281993,
""title"": ""Netbeans won't open in Ubuntu""
}"
bash-4.3$ python3 parse_questions.py questions.txt
281952, Flash 11.2 No Longer Supported by Google Play
281993, Netbeans won't open in Ubuntu
Three more approaches:
-
awk
$ awk -F'": ' '/\"id\"/{id=$NF;} /\"title\"/{ t=$NF; sub(/^""/,"",t); sub(/""$/,"",t); print id,t }' OFS="" file 281952,Flash 11.2 No Longer Supported by Google Play 281993,Netbeans won't open in Ubuntu
-
Perl
$ perl -lne '$id=$1 if /id"":\s*(\d+)/; if(/title"":\s*""(.*)""/){print "$id,$1"}' file 281952,Flash 11.2 No Longer Supported by Google Play 281993,Netbeans won't open in Ubuntu
-
GNU grep with perl compatible regexes and simple perl:
$ grep -oP '(id"":\s*\K.*)|(title"":\s*""\K.*(?=""))' file | perl -pe 'chomp if $.%2' 281952,Flash 11.2 No Longer Supported by Google Play 281993,Netbeans won't open in Ubuntu
This is not exactly answering your question or solving your issue, but to get rid off the unwanted characters you can use tr:
cat QueryR | tr -d '}{:"'
and you'll get: