How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sรก's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools
monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'๐๐จ๐จ'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'๐๐จ๐จ'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
>>> csr.execute('select * from utf8test')
>>> r = csr.fetchone()
>>> r
>>> print r[0]
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string