What's the most robust way to efficiently parse CSV using awk?

Solution 1:

If your CSV cannot contain newlines or escaped double quotes then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,with,commas",bar' |
    awk -v FPAT='[^,]*|"[^"]+"' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}'
1 <foo>
2 <"field,with,commas">
3 <bar>

If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1  fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4

Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk is:

$ cat decsv.awk
function buildRec(      i,orig,fpat,done) {
    $0 = PrevSeg $0
    if ( gsub(/"/,"&") % 2 ) {
        PrevSeg = $0 RS
        done = 0
    }
    else {
        PrevSeg = ""
        gsub(/@/,"@A"); gsub(/""/,"@B")            # <"x@foo""bar"> -> <"x@Afoo@Bbar">
        orig = $0; $0 = ""                         # Save $0 and empty it
        fpat = "([^" FS "]*)|(\"[^\"]+\")"         # Mimic GNU awk FPAT meaning
        while ( (orig!="") && match(orig,fpat) ) { # Find the next string matching fpat
            $(++i) = substr(orig,RSTART,RLENGTH)   # Create a field in new $0
            gsub(/@B/,"\"",$i); gsub(/@A/,"@",$i)  # <"x@Afoo@Bbar"> -> <"x@foo"bar">
            gsub(/^"|"$/,"",$i)                    # <"x@foo"bar">   -> <x@foo"bar>
            orig = substr(orig,RSTART+RLENGTH+1)   # Move past fpat+sep in orig $0
        }
        done = 1
    }
    return done
}

BEGIN { FS=OFS="," }
!buildRec() { next }
{
    printf "Record %d:\n", ++recNr
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    }
    print "----"
}

.

$ awk -f decsv.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.

The gsub(/@/,"@A"); gsub(/""/,"@B") converts every pair of double quotes axcross the whole record (bear in mind these "" pairs can only apply within quoted fields) to a string @B that does not contain a double quote so that when we split the record into fields the match() doesn't get tripped up by quotes appearing inside fields. The gsub(/@B/,"\"",$i); gsub(/@A/,"@",$i) restores the quotes inside each field individually and also converts the ""s to the "s they really represent.

Also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

Solution 2:

An improvement upon @EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).

gawk -v FPAT='[^,]*|("[^"]*")+' ...

This STILL

  1. isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.

  2. assumes GNU awk (gawk), a standard awk won't do.

Example:

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
  for(i=1; i<=NF;i++){
    if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
    print "<"$i">"
  }
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>