How to make MySQL handle UTF-8 properly
One of the responses to a question I asked yesterday suggested that I should make sure my database can handle UTF-8 characters correctly. How I can do this with MySQL?
Solution 1:
Update:
Short answer - You should almost always be using the utf8mb4
charset and utf8mb4_unicode_ci
collation.
To alter database:
ALTER DATABASE dbname CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
See:
Aaron's comment on this answer How to make MySQL handle UTF-8 properly
What's the difference between utf8_general_ci and utf8_unicode_ci
Conversion guide: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html
Original Answer:
MySQL 4.1 and above has a default character set of UTF-8. You can verify this in your my.cnf
file, remember to set both client and server (default-character-set
and character-set-server
).
If you have existing data that you wish to convert to UTF-8, dump your database, and import it back as UTF-8 making sure:
- use
SET NAMES utf8
before you query/insert into the database - use
DEFAULT CHARSET=utf8
when creating new tables - at this point your MySQL client and server should be in UTF-8 (see
my.cnf
). remember any languages you use (such as PHP) must be UTF-8 as well. Some versions of PHP will use their own MySQL client library, which may not be UTF-8 aware.
If you do want to migrate existing data remember to backup first! Lots of weird choping of data can happen when things don't go as planned!
Some resources:
- complete UTF-8 migration (cdbaby.com)
- article on UTF-8 readiness of php functions (note some of this information is outdated)
Solution 2:
To make this 'permanent', in my.cnf
:
[client]
default-character-set=utf8
[mysqld]
character-set-server = utf8
To check, go to the client and show some variables:
SHOW VARIABLES LIKE 'character_set%';
Verify that they're all utf8
, except ..._filesystem
, which should be binary
and ..._dir
, that points somewhere in the MySQL installation.
Solution 3:
MySQL 4.1 and above has a default character set that it calls utf8
but which is actually only a subset of UTF-8 (allows only three-byte characters and smaller).
Use utf8mb4
as your charset if you want "full" UTF-8.