Python UTF-16 CSV reader
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open
to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open
and codecs.open
require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi
. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85
is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on specific platforms such as OpenVMS. When they turn up in documents, Web pages, e-mail messages, etc., which are ostensibly in an ISO-8859-n encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding such as Windows-1252 or the Apple Macintosh ("MacRoman") character set that use the codes provided for representation of the C1 set with a single 8-bit byte to instead provide additional graphic characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96
:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.