Transforming CSV file using sed

Solution 1:

( tr , ';' | tr -d '"' ) < input.csv > output.csv

I'd use Perl

perl -pe 'tr/,"/;/d' input.csv > output.csv

-- but this specific task isn't beyond sed. You cannot merge the two expressions.

Solution 2:

Which you prefer (perl, sed, awk) is up to you; they'll all get the job done. Since you asked for sed, and the others are posted, here ya go. This is a simpler form of your regex's and works with your example line:

$ sed -e 's/"//g; s/,/;/g' infile.csv > outfile.csv

Note you can join the two expressions with a semicolon after each substitution. Tested with GNU sed v4.1.5.

Here's your original expressions joined:

$ sed -e 's/","/;/g; s/"//g' infile.csv > outfile.csv

I'm reasonably sure it's possible to merge the two substitutions. Not sure what it would be offhand, and I'm pretty sure the result will be much less readable than the script at the top. If I come up with something (or someone else weighs in in the comments) I'll add it here.

Solution 3:

Since you're dealing with records, awk makes more sense. That said, it's not really good at CSV, since the field delimiters are somewhat variable. But if you're certain that all fields are surrounded by doublequotes, this will work:

awk -F'","' 'BEGIN {OFS=";"} { gsub(/(^")|("$)/, ""); $1=$1; print }'

This sets awk's input field separator to "","" (including the inner set of doublequotes). This almost works, except you have to deal with the leading and trailing doublequotes, which are stripped with the gsub function. The $1=$1 forces it to recompile the record with the new output field separator, which was defined as ; in the BEGIN block. Then print prints out the whole record.

This is a little tidier:

awk -F '(^")|(",")|("$)' 'BEGIN {OFS=";"} { $1=$1; print }'

It sets the input field separator to a regular expression that includes the doublequotes at the beginning and end of the record, but it also causes it to print out an empty beginning and trailing field. You can easily get rid of the trailing field:

awk -F '(^")|(",")|("$)' 'BEGIN {OFS=";"} { NF=NF-1; $1=$1; print }'

NF is the number of fields, and reducing it by one lops off the last field. But I can't think of a way to chop off the first field.

If you know that the input always has five fields, though, you could do this:

awk -F '(^")|(",")|("$)' 'BEGIN {OFS=";"} { print $2,$3,$4,$5,$6 }'

Notice this gets rid of the $1=$1 construct, which we only need if we're printing the (implied) $0.

All that said, I'd probably end up using perl and one of the many available CSV modules on CPAN.