Series of sed commands work on command line, but not in a script

Using cat -v to turn CR characters into literal ^M sequences seems fundamentally ugly to me - if you need to remove DOS line endings, use dos2unix, tr, or sed 's/\r$//'

If you insist on using sed, then I suggest you print the bits you do want, rather than trying to delete all the random bits you don't - for example

$ sed -rn -e 's/\"//g' -e 's/(.*): (.*)\r/\2/p' QueryR | paste -d '' - -
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu

You could get fancy and roll the quote removal into the key-value extraction by matching zero or more quotes at each end of the value sequence

$ sed -rn 's/(.*): \"*([^"]*)\"*\r/\2/p' QueryR | paste -d '' - -
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu

You could get really fancy and emulate the paste in sed by first joining pairs of lines on the ,\r$ ending and then matching the key-value pairs multiply (g) and non-greedily

$ sed -rn '/,\r$/ {N; s/([^:]*): \"*([^:"]*)\"*\r\n?/\2/gp}' QueryR
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu

(Personally I'd favor the KISS approach and use the first one).


FWIW, since your input appears to be over-quoted JSON, I'd suggest installing a proper JSON parser such as jq

sudo apt-get install jq

You can then do something like

$ sed -e 's/["]["]/"/g' -e 's/"{/{/' -e 's/}"/}/' QueryR | jq '.id, .title' | paste -d, - -
281952,"Flash 11.2 No Longer Supported by Google Play"
281993,"Netbeans won't open in Ubuntu"

which removes the superfluous quotes and then uses jq to extract the fields of interest - note that jq seems to handle the DOS-style line endings, so there's no need to to take special steps to remove those.

Change to jq '.[]' to dump all the attribute-value pairs.

Credit for inspiration and basic jq syntax taken from Overcoming newlines with grep -o


I fixed it thanks to steeldriver & further tinkering. Unrefined but works.

sed  '{
       s/"{//
       s/}"//
       s/^"//
       /,\r/{N;/\n.*title.*:\s/{s/,\r\n.*title.*:\s/,/}}
       s/""//g
       s/^\s\+//
       /^\s*$/d
       s/^id:\ //
       s/\\//g
}' QueryR* | tee "$1"

translation:
s/"{// Remove "{
s/}"// Remove }"
s/^"// Remove " from start of line
/,\r/{N;/\n.*title.*:\s/{s/,\r\n.*title.*:\s/,\ /}} match ,\r on one line and [whatever]title[whatever]: on the next line, replace all that with ,
s/""//g Remove all the remaining double double quotes
s/^\s\+// Remove whitespace from start of lines
/^\s*$/d Remove empty lines
s/^id:\ // Remove id: and space after it
s/\\//g Remove backslashes (escape chars for " added to some title fields)
tee "$1" specify an outfile when running the script, for example ./queryclean newquery.csv


While the question asks for sed, one could work around sed's issues with Python:

from __future__ import print_function
import sys

with open(sys.argv[1]) as f:
     for line in f:
         if '""id""' in line:
            print(line.strip().split(':')[1],end="")
         if '""title""' in line:
            title = " ".join(line.strip().split(':')[1:])
            print(title.replace('""'," "))

This code is compliant with both python2 and python3 , so either will work

Sample run:

bash-4.3$ cat questions.txt 
"{
  ""id"": 281952,
  ""title"": ""Flash 11.2 No Longer Supported by Google Play""
}"
"{
  ""id"": 281993,
  ""title"": ""Netbeans won't open in Ubuntu""
}"
bash-4.3$ python3 parse_questions.py questions.txt 
 281952,  Flash 11.2 No Longer Supported by Google Play 
 281993,  Netbeans won't open in Ubuntu 

Three more approaches:

  1. awk

    $ awk -F'": ' '/\"id\"/{id=$NF;} 
                  /\"title\"/{
                    t=$NF; 
                    sub(/^""/,"",t); 
                    sub(/""$/,"",t); 
                    print id,t
                  }' OFS="" file 
    281952,Flash 11.2 No Longer Supported by Google Play
    281993,Netbeans won't open in Ubuntu
    
  2. Perl

    $ perl -lne '$id=$1 if /id"":\s*(\d+)/; 
                 if(/title"":\s*""(.*)""/){print "$id,$1"}' file 
    281952,Flash 11.2 No Longer Supported by Google Play
    281993,Netbeans won't open in Ubuntu
    
  3. GNU grep with perl compatible regexes and simple perl:

    $ grep -oP '(id"":\s*\K.*)|(title"":\s*""\K.*(?=""))' file | 
        perl -pe 'chomp if $.%2'
    281952,Flash 11.2 No Longer Supported by Google Play
    281993,Netbeans won't open in Ubuntu
    

This is not exactly answering your question or solving your issue, but to get rid off the unwanted characters you can use tr:

cat QueryR | tr -d '}{:"' 

and you'll get:

Enter image description here