Why shouldn’t I use Unicode characters to simulate typographic styles (such as small caps or script)?

Unicode contains various characters that look like typographically stylised variants of characters of the basic Latin alphabet and that allow one to write texts in the corresponding typographic styles without resorting to mark-up or similar. For example, one can simulate:

  • Small caps:

    ʙᴇʜᴏʟᴅ ᴛʜɪꜱ ꜰᴀɴᴄɪʟy ᴇɴᴄᴏᴅᴇᴅ ᴛᴇxᴛ.

  • Script:

    𝓑𝓮𝓱𝓸𝓵𝓭 𝓽𝓱𝓲𝓼 𝓯𝓪𝓷𝓬𝓲𝓵𝔂 𝓮𝓷𝓬𝓸𝓭𝓮𝓭 𝓽𝓮𝔁𝓽.

  • Blackletter:

    𝕭𝖊𝖍𝖔𝖑𝖉 𝖙𝖍𝖎𝖘 𝖋𝖆𝖓𝖈𝖎𝖑𝖞 𝖊𝖓𝖈𝖔𝖉𝖊𝖉 𝖙𝖊𝖝𝖙.

This met interest on Stack Exchange (e.g., here, here, and here) and criticism of such techniques was made. But what can go wrong when I use them?


General

Those characters are not intended for regular Latin-alphabet text but for phonetics, Cyrillic-alphabet text, for use as mathematical symbols (representing variables), or similar. The only Unicode-compliant way to encode text in the basic Latin alphabet is to use the characters predominantly used for this purpose (i.e., from the Basic Latin Unicode block).

As with many other standards, you should think twice about violating Unicode. Moreover, Unicode comprises so many writing systems, use cases, and stuff that just exists for backwards compatibility with other standards1 that fully understanding all its motivations is a science of its own. Long story short, unless you really really know what you are doing, it is extremely likely that something breaks that you haven’t even remotely thought of.

Specific examples

Accessibility

Encoded text does not only exist to be rendered in some font. It can also be interpreted, e.g., by screen readers. And a screen reader should not need to guess whether

𝓽𝓱𝓮

is meant to be the definite article or the mathematical product2 of the variables 𝓽, 𝓱, and 𝓮 – which is what those characters are made for. The best behaviour will therefore be that it spells out these characters, e.g. saying literally the following:

bold script small t, bold script small h, bold script small e

It should not just say “the” instead because then it would not properly read mathematical texts whose symbols happen to form a pronouncable word.3

Portability

If your text is nicely rendered on your machine, this does not mean that it will also be on the reader’s one. The most obvious example is that the reader does not have any font supporting these characters or the text is rendered by a software not supporting fallback fonts. Admittedly, this is becoming increasingly less common. Keep in mind though that some people like dyslexics need special fonts which are less likely to support these characters.

But even if the reader’s machine only uses a different font, this may make the text considerably less readable. For a first example, this is 𝓉𝒽ℯ rendered with two different fonts:

𝓉𝒽ℯ rendered with FreeSerif and STIX

Free Serif renders the text as you would probably want it to be rendered when using special characters to simulate text, namely simulating handwriting with a continuous stroke. However, these characters are made for use as mathematical symbols, connecting which makes no sense. Hence the rendering by STIX, which is specifically designed for mathematical purposes, is more in line with how these characters are intended to be used.

In a second example, suppose you or the reader italicise “сᴜт мy вᴀʀ” for some reason. With a good font, you will get4:

сᴜт мy вᴀʀ rendered in upright and italics. italics reads as "cum my bar"

The reason for this is that the small caps were (partially) simulated with Cyrillic letters, and Cyrillic italics sometimes look very different from their upright counterparts. So again, this is the proper behaviour.

Searchability

As a first example, consider what you would want a reasonable search to do with the character 𝒲 (mathematical script W). Assume that the search has two modes, the default mode and the exact mode (usually called case-sensitive). This character should be:

  • found when searching for w or W in default mode – for those who do not want to bother entering or copy-pasting the special character into the search field;

  • found when searching for 𝒲 in exact mode – for those who want to search where the corresponding variable is mentioned in a mathematical document³;

  • not found when searching for 𝓌, w or W in exact mode on account of breaking a search similar to the above.

However if you use this character to simulate regular text, it should be found when searching for W or 𝒲 in exact mode, which is in conflict with the above.

As a second example consider that Cyrillic characters should never be found when searching Latin characters and vice versa, as they completely different things. However if use Cyrillic characters to simulate Latin small caps, you need this to happen, if you do not want searchability to be broken. This would lead to people find a lot of useless stuff if they search for a rare Latin-alphabet word that just so happens to correspond the faux small caps of some popular Cyrillic-alphabet word (and vice versa).

An exact search option cannot solve this problem, as this is reserved for other purposes in those alphabets.

In general, it is impossible to build a search (without an insane amount of options) that is not broken by using special characters to simulate styled Latin text.


1 You know that XKCD about the inevitable failure of unifying standards? Well, Unicode succeeded.
2 or whatever the empty operator is in the pertinent convention
3 I am aware that very few mathematical texts nowadays support this encoding or something compatible to it but the point is that some day they hopefully do. Your Unicode-abusing text may still be around and read then.
4 Unless you are localising for Macedonian or Serbian, in which you will get different but still undesirable result.


What can go wrong? Well, I see this:

enter image description here

in Firefox 50.1.0 on Windows 7.

The problem of missing glyphs, in this case on a mobile device, is further illustrated in an image given by user Chris Kent in a comment, which I have cropped and resized from the original:

enter image description here

And user oals kindly contributed another example:

enter image description here


I'm having an XY problem with this.

Y and X appear smaller than the rest of the text

Here, we see that Y and X appear smaller than the rest of the text. At certain zoom levels they appear to be the same size, but this appears to have exposed a problem with these particular glyphs in this particular font.


Using non-Latin characters that kinda-sorta look like Latin ones puts you in the company of spammers, pornmongers, and who-knows-what-they're-up-to dissemblers that want their text unsearchable, unindexable, and repudiatable. ("I never said it was safe!! I said it was sigma-alpha-integral-sign-epislon!!! Can't sue me!!!")

If you're comfy in that club, then go for it.