Parse CSV with double quote in some cases
You could use Matcher.find
with the following regular expression:
\s*("[^"]*"|[^,]*)\s*
Here's a more complete example:
String s = "a1, a2, a3, \"a4,a5\", a6";
Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
See it working online: ideone
I came across this same problem (but in Python), one way I found to solve it, without regexes, was: When you get the line, check for any quotes, if there are quotes, split the string on quotes, and split the even indexed results of the resulting array on commas. The odd indexed strings should be the full quoted values.
I'm no Java coder, so take this as pseudocode...
line = String[];
if ('"' in row){
vals = row.split('"');
for (int i =0; i<vals.length();i+=2){
line+=vals[i].split(',');
}
for (int j=1; j<vals.length();j+=2){
line+=vals[j];
}
}
else{
line = row.split(',')
}
Alternatively, use a regex.
Here is some code for you, I hope using code out of here doesn't count open source, which is.
package bestsss.util;
import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class SplitCSVLine {
public static String[] splitCSV(BufferedReader reader) throws IOException{
return splitCSV(reader, null, ',', '"');
}
/**
*
* @param reader - some line enabled reader, we lazy
* @param expectedColumns - convenient int[1] to return the expected
* @param separator - the C(omma) SV (or alternative like semi-colon)
* @param quote - double quote char ('"') or alternative
* @return String[] containing the field
* @throws IOException
*/
public static String[] splitCSV(BufferedReader reader, int[] expectedColumns, char separator, char quote) throws IOException{
final List<String> tokens = new ArrayList<String>(expectedColumns==null?8:expectedColumns[0]);
final StringBuilder sb = new StringBuilder(24);
for(boolean quoted=false;;sb.append('\n')) {//lazy, we do not preserve the original new line, but meh
final String line = reader.readLine();
if (line==null)
break;
for (int i = 0, len= line.length(); i < len; i++) {
final char c = line.charAt(i);
if (c == quote) {
if( quoted && i<len-1 && line.charAt(i+1) == quote ){//2xdouble quote in quoted
sb.append(c);
i++;//skip it
}else{
if (quoted){
//next symbol must be either separator or eol according to RFC 4180
if (i==len-1 || line.charAt(i+1) == separator){
quoted = false;
continue;
}
} else{//not quoted
if (sb.length()==0){//at the very start
quoted=true;
continue;
}
}
//if fall here, bogus, just add the quote and move on; or throw exception if you like to
/*
5. Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields.
*/
sb.append(c);
}
} else if (c == separator && !quoted) {
tokens.add(sb.toString());
sb.setLength(0);
} else {
sb.append(c);
}
}
if (!quoted)
break;
}
tokens.add(sb.toString());//add last
if (expectedColumns !=null)
expectedColumns[0] = tokens.size();
return tokens.toArray(new String[tokens.size()]);
}
public static void main(String[] args) throws Throwable{
java.io.StringReader r = new java.io.StringReader("222,\"\"\"zzzz\", abc\"\" , 111 ,\"1\n2\n3\n\"");
System.out.println(java.util.Arrays.toString(splitCSV(new BufferedReader(r))));
}
}
The below code seems to work well and can handle quotes within quotes.
final static Pattern quote = Pattern.compile("^\\s*\"((?:[^\"]|(?:\"\"))*?)\"\\s*,");
public static List<String> parseCsv(String line) throws Exception
{
List<String> list = new ArrayList<String>();
line += ",";
for (int x = 0; x < line.length(); x++)
{
String s = line.substring(x);
if (s.trim().startsWith("\""))
{
Matcher m = quote.matcher(s);
if (!m.find())
throw new Exception("CSV is malformed");
list.add(m.group(1).replace("\"\"", "\""));
x += m.end() - 1;
}
else
{
int y = s.indexOf(",");
if (y == -1)
throw new Exception("CSV is malformed");
list.add(s.substring(0, y));
x += y;
}
}
return list;
}