When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?
Is there a rule when we must use the Unicode types?
I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns.
I am looking for something like:
- If you have Chinese --> use NVARCHAR
- If you have German and Arabic --> use NVARCHAR
What about the collation of the server/database?
I don't want to use always NVARCHAR like suggested here What are the main performance differences between varchar and nvarchar SQL Server data types?
Solution 1:
The real reason you want to use NVARCHAR is when you have different languages in the same column, you need to address the columns in T-SQL without decoding, you want to be able to see the data "natively" in SSMS, or you want to standardize on Unicode.
If you treat the database as dumb storage, it is perfectly possible to store wide strings and different (even variable-length) encodings in VARCHAR (for instance UTF-8). The problem comes when you are attempting to encode and decode, especially if the code page is different for different rows. It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on (potentially variably) encoded columns.
Using NVARCHAR avoids all this.
I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained.
I would recommend VARCHAR for any column which is a natural key (like a vehicle license plate, SSN, serial number, service tag, order number, airport callsign, etc) which is typically defined and constrained by a standard or legislation or convention. Also VARCHAR for user-entered, and very constrained (like a phone number) or a code (ACTIVE/CLOSED, Y/N, M/F, M/S/D/W, etc). There is absolutely no reason to use NVARCHAR for those.
So for a simple rule:
VARCHAR when guaranteed to be constrained NVARCHAR otherwise
Solution 2:
Both the two most upvoted answers are wrong. It should have nothing to do with "store different/multiple languages". You can support Spanish characters like ñ
and English, with just common varchar
field and Latin1_General_CI_AS
COLLATION
, e.g.
Short Version
You should use NVARCHAR
/NCHAR
whenever the ENCODING
, which is determined by COLLATION
of the field, doesn't support the characters needed.
Also, depending on the SQL Server version, you can use specific COLLATIONs
, like Latin1_General_100_CI_AS_SC_UTF8
which is available since SQL Server 2019. Setting this collation on a VARCHAR
field (or entire table/database), will use UTF-8
ENCODING
for storing and handling the data on that field, allowing fully support UNICODE
characters, and hence any languages embraced by it.
To FULLY UNDERSTAND:
To fully understand what I'm about to explain, it's mandatory to have the concepts of UNICODE
, ENCODING
and COLLATION
all extremely clear in your head. If you don't, then first take a look below at my humble and simplified explanation on "What is UNICODE, ENCODING, COLLATION and UTF-8, and how they are related" section and supplied documentation links. Also, everything I say here is specific to Microsoft SQL Server
, and how it stores and handles data in char
/nchar
and varchar
/nvarchar
fields.
Let's say we wanna store a peculiar text on our MSSQL Server database. It could be an Instagram comment as "I love stackoverflow! 😍".
The plain English part would be perfectly supported even by ASCII, but since there are also an emoji, which is a character specified in the UNICODE
standard, we need an ENCODING
that supports this Unicode character.
MSSQL Server uses the COLLATION
to determine what ENCODING
is used on char
/nchar
/varchar
/nvarchar
fields. So, differently than a lot think, COLLATION
is not only about sorting and comparing data, but also about ENCODING
, and by consequence: how our data will be stored!
So, HOW WE KNOW WHAT IS THE ENCODING USED BY OUR COLLATION? With this:
SELECT COLLATIONPROPERTY( 'Latin1_General_CI_AI' , 'CodePage' ) AS [CodePage]
--returns 1252
This simple SQL returns the Windows Code Page
for a COLLATION
. A Windows Code Page
is nothing more than another mapping to ENCODINGs
. For the Latin1_General_CI_AI
COLLATION
it returns the Windows Code Page
code 1252
, that maps to Windows-1252
ENCODING
.
So, for a varchar
column, with Latin1_General_CI_AI
COLLATION
, this field will handle its data using the Windows-1252
ENCODING
, and only correctly store characters supported by this encoding.
If we check the Windows-1252 ENCODING
specification Character List for Windows-1252, we will find out that this encoding won't support our emoji character. And if we still try it out:
OK, SO HOW CAN WE SOLVE THIS?? Actually, it depends, and that is GOOD!
NCHAR
/NVARCHAR
Before SQL Server 2019 all we had was NCHAR
and NVARCHAR
fields. Some say they are UNICODE
fields. THAT IS WRONG!. Again, it depends on the field's COLLATION
and also SQLServer Version.
Microsoft's "nchar and nvarchar (Transact-SQL)" documentation specifies perfectly:
Starting with SQL Server 2012 (11.x), when a Supplementary Character (SC) enabled collation is used, these data types store the full range of Unicode character data and use the UTF-16 character encoding. If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.
In other words, if we use SQL Server older that 2012, like SQL Server 2008 R2 for example, the ENCODING
for those fields will use UCS-2 ENCODING
which support a subset of UNICODE
. But if we use SQL Server 2012 or newer, and define a COLLATION
that has Supplementary Character
enabled, than with our field will use the UTF-16
ENCODING
, that fully supports UNICODE
.
BUT WHAIT, THERE IS MORE! WE CAN USE UTF-8 NOW!!
CHAR
/VARCHAR
Starting with SQL Server 2019, WE CAN USE CHAR
/VARCHAR
fields and still fully support UNICODE
using UTF-8
ENCODING
!!!
From Microsoft's "char and varchar (Transact-SQL)" documentation:
Starting with SQL Server 2019 (15.x), when a UTF-8 enabled collation is used, these data types store the full range of Unicode character data and use the UTF-8 character encoding. If a non-UTF-8 collation is specified, then these data types store only a subset of characters supported by the corresponding code page of that collation.
Again, in other words, if we use SQL Server older that 2019, like SQL Server 2008 R2 for example, we need to check the ENCODING
using the method explained before. But if we use SQL Server 2019 or newer, and define a COLLATION
like Latin1_General_100_CI_AS_SC_UTF8
, then our field will use UTF-8
ENCODING
which is by far the most used and efficient encoding that supports all the UNICODE
characters.
Bonus Information:
Regarding the OP's observation on "I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns", I think it's nice to know why it is:
For the most common COLLATIONs
, like the default ones as Latin1_General_CI_AI
or SQL_Latin1_General_CP1_CI_AS
the ENCODING
will be Windows-1252
for varchar
fields. If we take a look on it's documentation, we can see that it supports:
English, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish. Plus also German, Finnish and French. And Dutch except the IJ character
But as I said before, it's not about language, it's about what characters do you expect to support/store, as shown in the emoji example, or some sentence like "The electric resistance of a lithium battery is 0.5Ω" where we have again plain English, and a Greek letter/character "omega" (which is the symbol for resistance in ohms), which won't be correctly handled by Windows-1252
ENCODING
.
Conclusion:
So, there it is! When use char
/nchar
and varchar
/nvarchar
depends on the characters that you want to support, and also the version of your SQL Server that will determines which COLLATIONs
and hence the ENCODINGs
you have available.
What is UNICODE, ENCODING, COLLATION and UTF-8, and how they are related
Note: all the explanations below are simplifications. Please, refer to the supplied documentation links to know all the details about those concepts.
-
UNICODE
- Is a standard, a convention, that aims to regulate all the characters in a unified and organized table. In this table, every character has an unique number. This number is commonly called character'scode point
.
UNICODE IS NOT AN ENCODING! -
ENCODING
- Is a mapping between a character and a byte/bytes sequence. So a encoding is used to "transform" a character to bytes and also the other way around, from bytes to a character. Among the most popular ones areUTF-8
,ISO-8859-1
,Windows-1252
andASCII
. You can think of it as a "conversion table" (i really simplified here). -
COLLATION
- That one is important. Even Microsoft's documentation doesn't let this clear as it should be. A Collation specifies how your data would be sorted, compared, AND STORED!. Yeah, I bet you was not expecting for that last one, right!? The collations onSQL Server
determines too what would be theENCODING
used on that particularchar
/nchar
/varchar
/nvarchar
field. -
ASCII ENCODING
- Was one of the firsts encodings. It is both the character table (like an own tiny version ofUNICODE
) and its byte mappings. So it doesn't map a byte toUNICODE
, but map a byte to its own character's table. Also, it always use only 7bits, and supported 128 different characters. It was enough to support all English letters upper and down cased, numbers, punctuation and some other limited number of characters. The problem with ASCII is that since it only used 7bits and almost every computer was 8bits at the time, there were another 128 possibilities of characters to be "explored", and everybody started to map this "available" bytes to its own table of characters, creating a lot of differentENCODINGs
. -
UTF-8 ENCODING
- This is anotherENCODING
, one of the most (if not the most) usedENCODING
around. It uses variable byte width (one character can be from 1 to 6 bytes long, by specification) and fully supports allUNICODE
characters. -
Windows-1252 ENCODING
- Also one of the most usedENCODING
, it's widely used on SQL Server. It's fixed-size, so every one character is always 1byte. It also supports a lot of accents, from various languages but doesn't support all existing, nor supportsUNICODE
. That's why yourvarchar
field with a common collation likeLatin1_General_CI_AS
supportsá
,é
,ñ
characters, even that it isn't using a supportiveUNICODE
ENCODING
.
Resources:
https://blog.greglow.com/2019/07/25/sql-think-that-varchar-characters-if-so-think-again/
https://medium.com/@apiltamang/unicode-utf-8-and-ascii-encodings-made-easy-5bfbe3a1c45a
https://www.johndcook.com/blog/2019/09/09/how-utf-8-works/
https://www.w3.org/International/questions/qa-what-is-encoding
https://en.wikipedia.org/wiki/List_of_Unicode_characters
https://www.fileformat.info/info/charset/windows-1252/list.htm
https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/t-sql/statements/windows-collation-name-transact-sql?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/t-sql/statements/sql-server-collation-name-transact-sql?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15#SQL-collations
SQL Server default character encoding
https://en.wikipedia.org/wiki/Windows_code_page
Solution 3:
You should use NVARCHAR anytime you have to store multiple languages. I believe you have to use it for the Asian languages but don't quote me on it.
Here's the problem if you take Russian for example and store it in a varchar, you will be fine so long as you define the correct code page. But let's say your using a default english sql install, then the russian characters will not be handled correctly. If you were using NVARCHAR() they would be handled properly.
Edit
Ok let me quote MSDN and maybee I was to specific but you don't want to store more then one code page in a varcar column, while you can you shouldn't
When you deal with text data that is stored in the char, varchar, varchar(max), or text data type, the most important limitation to consider is that only information from a single code page can be validated by the system. (You can store data from multiple code pages, but this is not recommended.) The exact code page used to validate and store the data depends on the collation of the column. If a column-level collation has not been defined, the collation of the database is used. To determine the code page that is used for a given column, you can use the COLLATIONPROPERTY function, as shown in the following code examples:
Here's some more:
This example illustrates the fact that many locales, such as Georgian and Hindi, do not have code pages, as they are Unicode-only collations. Those collations are not appropriate for columns that use the char, varchar, or text data type
So Georgian or Hindi really need to be stored as nvarchar. Arabic is also a problem:
Another problem you might encounter is the inability to store data when not all of the characters you wish to support are contained in the code page. In many cases, Windows considers a particular code page to be a "best fit" code page, which means there is no guarantee that you can rely on the code page to handle all text; it is merely the best one available. An example of this is the Arabic script: it supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language as defined in Windows code page 1256. If you attempt to store these extra characters in a non-Unicode column that has the Arabic collation, the characters are converted into question marks.
Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h.
All in all with all the issues you have to deal with when dealing with internalitionalization. It is my opinion that is easier to just use Unicode characters from the start, avoid the extra conversions and take the space hit. Hence my statement earlier.