Excel save behaviour of CSV file with UTF8 encoding vs UTF8-Bom encoding
My original csv files are encoded with UTF8-BOM, this is so Excel can recognise the Unicode charcters (If its just encoded with UTF8 then Excel does not recognise the unicode characters).
My issue is that with a UTF8 encoded CSV file, when you click save in Excel it gives you a popup option to keep the original formatting. However if the file is UTF8-BOM encoded then clicking the save is instant and there is no option to keep the formatting. The CSV file then has it commas replaced with tabs.
Solution 1:
What happens is the following.
- Excel looks at your systems Regional settings to find what list-separator is configured, in your case, a comma. Here it's important to point out that CSV is not a single, well-defined format. Any separator could be used...(Edit: Well, there is actually a guideline that pins the separator down to being a comma, but this is not a binding standard, and the guideline is not strictly followed in practice.)
- Then, excel transforms the data, in-memory, into an XLSX file. You do your excel-ly things. Keep in mind that this is effectively a file-type conversion, and Excel does not remember what encoding your source file was in, or what separator was being used.
- You make canges, and click save. At that moment, Excel does no longer have a CSV file, but a XLSX file, and needs to convert it back to something "CSV-like". Because it knows the encoding needs to be UTF-8 with BOM, it decides to Save As "Unicode text" (it does this without explicitly mentioning this). In this format, they use UCS-2 LE with BOM as encoding, and tab as the separator. This is not the CSV standard you had in mind, or the standard your original file was formatted as. It's weird and counterintuitive, but strictly speaking not wrong.
So, what can you do?
In theory you should be able to use the Save as CSV option, and through the Tools->"Web options" menu (in the save as dialog) set the encoding to UTF-8. The only problem here is that this doesn't seem to work properly. My guess is that it is a broken feature. More on this on Stack Overflow: Excel to CSV with UTF-8 encoding
So, what can you really do?
Just use UTF-8 BOM, use save, and open the resulting CSV file to replace all tabs with commas.
And what should you really do?
Don't use Excel to load and edit CSV files. Or at least be a lot more careful about it...
Why not?
I've written a (way too long) article about it over here.