How to fix "Incorrect string value" errors?

After noticing an application tended to discard random emails due to incorrect string value errors, I went though and switched many text columns to use the utf8 column charset and the default column collate (utf8_general_ci) so that it would accept them. This fixed most of the errors, and made the application stop getting sql errors when it hit non-latin emails, too.

Despite this, some of the emails are still causing the program to hit incorrect string value errrors: (Incorrect string value: '\xE4\xC5\xCC\xC9\xD3\xD8...' for column 'contents' at row 1)

The contents column is a MEDIUMTEXT datatybe which uses the utf8 column charset and the utf8_general_ci column collate. There are no flags that I can toggle in this column.

Keeping in mind that I don't want to touch or even look at the application source code unless absolutely necessary:

  • What is causing that error? (yes, I know the emails are full of random garbage, but I thought utf8 would be pretty permissive)
  • How can I fix it?
  • What are the likely effects of such a fix?

One thing I considered was switching to a utf8 varchar([some large number]) with the binary flag turned on, but I'm rather unfamiliar with MySQL, and have no idea if such a fix makes sense.


Solution 1:

UPDATE to the below answer:

The time the question was asked, "UTF8" in MySQL meant utf8mb3. In the meantime, utf8mb4 was added, but to my knowledge MySQLs "UTF8" was not switched to mean utf8mb4.

That means, you'd need to specifically put "utf8mb4", if you mean it (and you should use utf8mb4)

I'll keep this here instead of just editing the answer, to make clear there is still a difference when saying "UTF8"

Original

I would not suggest Richies answer, because you are screwing up the data inside the database. You would not fix your problem but try to "hide" it and not being able to perform essential database operations with the crapped data.

If you encounter this error either the data you are sending is not UTF-8 encoded, or your connection is not UTF-8. First, verify, that the data source (a file, ...) really is UTF-8.

Then, check your database connection, you should do this after connecting:

SET NAMES 'utf8mb4';
SET CHARACTER SET utf8mb4;

Next, verify that the tables where the data is stored have the utf8mb4 character set:

SELECT
  `tables`.`TABLE_NAME`,
  `collations`.`character_set_name`
FROM
  `information_schema`.`TABLES` AS `tables`,
  `information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` AS `collations`
WHERE
  `tables`.`table_schema` = DATABASE()
  AND `collations`.`collation_name` = `tables`.`table_collation`
;

Last, check your database settings:

mysql> show variables like '%colla%';
mysql> show variables like '%charac%';

If source, transport and destination are utf8mb4, your problem is gone;)

Solution 2:

MySQL’s utf-8 types are not actually proper utf-8 – it only uses up to three bytes per character and supports only the Basic Multilingual Plane (i.e. no Emoji, no astral plane, etc.).

If you need to store values from higher Unicode planes, you need the utf8mb4 encodings.