Regex to split a CSV
I know this (or similar) has been asked many times but having tried out numerous possibilities I've not been able to find a a regex that works 100%.
I've got a CSV file and I'm trying to split it into an array, but encountering two problems: quoted commas and empty elements.
The CSV looks like:
123,2.99,AMO024,Title,"Description, more info",,123987564
The regex I've tried to use is:
thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
The only problem is that in my output array the 5th element comes out as 123987564 and not an empty string.
Solution 1:
Description
Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.
This expression will:
- divide your sample text on the comma delimits
- will process empty values
- will ignore double quoted commas, providing double quotes are not nested
- trims the delimiting comma from the returned value
- trims surrounding quotes from the returned value
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)
Example
Sample Text
123,2.99,AMO024,Title,"Description, more info",,123987564
ASP example using the non-java expression
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
For z = 0 to Matches.Count-1
results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
For zz = 0 to Matches(z).SubMatches.Count-1
results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
next
results=Left(results,Len(results)-1) & chr(13)
next
Response.Write "<pre>" & results
Matches using the non-java expression
Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma
[0][0] = 123
[0][1] =
[0][2] = 123
[1][0] = ,2.99
[1][1] =
[1][2] = 2.99
[2][0] = ,AMO024
[2][1] =
[2][2] = AMO024
[3][0] = ,Title
[3][1] =
[3][2] = Title
[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info
[5][0] = ,
[5][1] =
[5][2] =
[6][0] = ,123987564
[6][1] =
[6][2] = 123987564
Solution 2:
Worked on this for a bit and came up with this solution:
(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
Try it out here!
This solution handles "nice" CSV data like
"a","b",c,"d",e,f,,"g"
0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"
and uglier things like
"""test"" one",test' two,"""test"" 'three'","""test 'four'"""
0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""
Here's an explanation of how it works:
(?:,|\n|^) # all values must start at the beginning of the file,
# the end of the previous line, or at a comma
( # single capture group for ease of use; CSV can be either...
" # ...(A) a double quoted string, beginning with a double quote (")
(?: # character, containing any number (0+) of
(?:"")* # escaped double quotes (""), or
[^"]* # non-double quote characters
)* # in any order and any number of times
" # and ending with a double quote character
| # ...or (B) a non-quoted value
[^",\n]* # containing any number of characters which are not
# double quotes ("), commas (,), or newlines (\n)
| # ...or (C) a single newline or end-of-file character,
# used to capture empty values at the end of
(?:\n|$) # the file or at the ends of lines
)
Solution 3:
I created this a few months ago for a project.
".+?"|[^"]+?(?=,)|(?<=,)[^"]+
It works in C# and the Debuggex was happy when I selected Python and PCRE. Javascript doesn't recognize this form of Proceeded By ?<=....
For your values, it will create matches on
123
,2.99
,AMO024
,Title
"Description, more info"
,
,123987564
Note that anything in quotes doesn't have a leading comma, but attempting to match with a leading comma was required for the empty value use case. Once done, trim values as necessary.
I use RegexHero.Net to test my Regex.