True difference between Excel CSV and Standard CSV

What is the true difference between an Excel CSV and a standard CSV?

For example, when handling columns with line breaks inside one cell, how do they encode it differently?


Solution 1:

It absolutely depends on what you define as "standard" CSV. As far as I'm concerned, Excel follows the rules outlined in RFC 4180, the "Common Format and MIME Type for CSV Files".

Consider a table where the first cell in the first row has two line breaks. In Excel, it would look like the following:

+---+---------------+------------+------------+
|   | A             | B          | C          |
+---+---------------+------------+------------+
| 1 | col1, line1a  |            |            |
|   | col1, line1b  |            |            |
|   | col1, line1c  | col2, row1 | col3, row1 |
| 2 | col1, row2    | col2, row2 | col3, row2 |
+---+---------------+------------+------------+

Now, how would Excel export this? Let's see – a text editor would display this:

"col1, line1a
col1, line1b
col1, line1c","col2, row1","col3, row1"
"col1, row2","col2, row2","col3, row2"

Not very sophisticated. It inserts a carriage return (hex 0D) where the line break was in our cell. Every cell is surrounded by double quotes. Also, the actual rows are separated with a carriage return.

In order to parse this correctly, a CSV parser would need to

  • ignore that carriage return when it appears within double quotes (i.e. a cell)
  • not ignore that carriage return when it appears outside of double quotes

If it didn't do that, you'd end up with something garbled like this – note that there are now four lines instead of two, because it failed to ignore the line breaks.

+---+---------------+------------+------------+
|   | A             | B          | C          |
+---+---------------+------------+------------+
| 1 | "col1, line1a |            |            |
| 2 | col1, line1b  |            |            |
| 3 | col1, line1c" | col2, row1 | col3, row1 |
| 4 | col1, row2    | col2, row2 | col3, row2 |
+---+---------------+------------+------------+

But, let's see what the RFC says, maybe Excel did it right?

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

Neat, that's exactly what Excel did. So summarizing, Excel seems to follow the recommendations of a "standard" CSV file. Given a proper CSV parser, it should be able to read Excel CSV files as well.