How to make awk ignore the field delimiter inside double quotes? [duplicate]

From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"[email protected],www.example.com",field4
"[email protected]",field4

and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.


This is not a bash/awk solution, but I recommend CSVKit, which can be installed by pip install csvkit. It provides a collection of command line tools to work specifically with CSV, including csvcut, which does exactly what you ask for:

csvcut --columns=1,4 <<EOF
"[email protected],www.example.com",field2,field3,field4
"[email protected]",field2,field3,field4
EOF

Output:

"[email protected],www.example.com",field4
[email protected],field4

It strips the unnecessary quotes, which I suppose shouldn't be a problem.

Read the docs of CSVKit here on RTD. ThoughtBot has a nice little blog post introducing this tool, which is where I learnt about CSVKit.


In your sample input file, it is the first field and only the first field, that is quoted. If this is true in general, then consider the following as a method for deleting the second and third columns:

$ awk -F, '{for (i=1;i<=NF;i++){printf "%s%s",(i>1)?",":"",$i; if ($i ~ /"$/)i=i+2};print""}' file
"[email protected],www.example.com",field4
"[email protected]",field4

As mentioned in the comments, awk does not natively understand quoted separators. This solution works around that by looking for the first field that ends with a quote. It then skips the two fields that follow.

The Details

  • for (i=1;i<=NF;i++)

    This starts a for over each field i.

  • printf "%s%s",(i>1)?",":"",$i

    This prints field i. If it is not the first field, the field is preceded by a comma.

  • if ($i ~ /"$/)i=i+2

    If the current field ends with a double-quote, this then increments the field counter by 2. This is how we skip over fields 2 and 3.

  • print""

    After we are done with the for loop, this prints a newline.


This awk should work regardless of where the quoted field is and works on escaped quotes as well.

awk '{while(match($0,/"[^"]+",|([^,]+(,|$))/,a)){
      $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
      print b[1] b[4];x=0}' file

Input

"[email protected],www.example.com",field2,field3,field4  
"[email protected]",field2,field3,field4  
field1,"[email protected],www.example.com",field3,field4  

Output

"[email protected],www.example.com",field4
"[email protected]",field4
field1,field4

It even works on

field1,"field,2","but this field has ""escaped"\" quotes",field4

That the mighty FPAT variable fails on !


Explanation

 while(match($0,/"[^"]+",|([^,]+(,|$))/,a))

Starts a while loop that continues as long as the match is a success(i.e there is a field).
The match matches the first occurence of the regex which incidentally matches the fields and store it in array a

 $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]

Sets $0 to begin at the end of matched field and adds the matched field to the corresponding array position in b.

  print b[1] b[4];x=0}

Prints the fields you want from b and sets x back to zero for the next line.


Flaws

Will fail if field contains both escaped quotes and a comma


Edit

Updated to support empty fields

awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
     $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
     print b[1] b[4];x=0}' file