How much UTF-8 text fits in a MySQL "Text" field?
According to MySQL, a text
column holds 65,535 bytes.
So if this a legitimate boundary then will it actually only fit about 32k UTF-8 characters, right? Or is this one of those "fuzzy" boundaries where the guys that wrote the docs can't tell characters from bytes and it will actually allow ~64k UTF-8 characters if set to something like utf8_general_ci
?
Solution 1:
A text
column can be up to 65,535
bytes.
An utf-8
character can be up to 3 bytes.
So... your actual limit can be 21,844
characters.
See the manual for more info: http://dev.mysql.com/doc/refman/5.1/en/string-type-overview.html
A variable-length string. M represents the maximum column length in characters. The range of M is 0 to 65,535. The effective maximum length of a VARCHAR is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used. For example, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters.
Solution 2:
UTF-8 characters can take up to 4 bytes each, not 2 as you are supposing. UTF-8 is a variable-width encoding, depending on the number of significant bits in the Unicode code point:
- 7 bits and under in the Unicode code point: 1 byte in UTF-8
- 8 to 11 bits: 2 bytes in UTF-8
- 12 to 16 bits: 3 bytes
- 17 to 21 bits: 4 bytes
The original UTF-8 spec allows encoding up to 31-bit Unicode values, taking as many as 6 bytes to encode in UTF-8 form. After UTF-8 became popular, the Unicode Consortium declared that they will never use code points beyond 221 - 1. This is now standardized as RFC 3629.
MySQL currently (i.e. version 5.6) only supports the Unicode Basic Multilingual Plane characters, for which UTF-8 needs up to 3 bytes per character. That means the current answer to your question is that your TEXT
field can hold at least 21,844 characters.
Depending on how you look at it, the actual limits are higher or lower than that:
-
If you assume, as I do, that the BMP limitation will eventually be lifted in MySQL or one of its forks, you shouldn't count on being able to store more than 16,383 characters in that field if your MySQL client allows arbitrary Unicode text input.
-
On the other hand, you may be able to exploit the fact that UTF-8 is a variable width encoding. If you know your text is mostly plain English with just the occasional non-ASCII character, your effective in-practice limit could approach the maximum 64 KB - 1 character limit.