Writing UTF-8 String to MySQL with Python

I am trying to push user account data from an Active Directory to our MySQL-Server. This works flawlessly but somehow the strings end up showing an encoded version of umlauts and other special characters.

The Active Directory returns a string using this sample format: M\xc3\xbcller

This actually is the UTF-8 encoding for Müller, but I want to write Müller to my database not M\xc3\xbcller.

I tried converting the string with this line, but it results in the same string in the database: tempEntry[1] = tempEntry[1].decode("utf-8")

If I run print "M\xc3\xbcller".decode("utf-8") in the python console the output is correct.

Is there any way to insert this string the right way? I need this specific format for a web developer who wants to have this exact format, I don't know why he is not able to convert the string using PHP directly.

Additional info: I am using MySQLdb; The table and column encoding is utf8_general_ci


Solution 1:

As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

So, something like:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'M\xc3\xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I'm unsure if this is required. 5 mins testing should help you decide.

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1

Solution 2:

Assuming you are using MySQLdb you need to pass use_unicode=True and charset="utf8" when creating your connection.

UPDATE: If I run the following against a test table I get -

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'M\xfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('M\xc3\xbcller',),)

This is "the right way", the characters are being stored and retrieved correctly, your friend writing the php script just isn't handling the encoding correctly when outputting.

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.