Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8
Solution 1:
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv
to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string
is "Non sp\xE9cifi\xE9"
, then utf8_string
will be "Non spécifié"
. Also, Iconv.iconv
can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string
thinks it is in ASCII-8BIT but is really in ISO-8859-1.
Solution 2:
With ruby >= 1.9 you can use
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8
is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
If you prefer a more verbose code, you can use:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
Solution 3:
I have been dealing with this issue for a while and not any of the other solutions worked for me.
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)