Unicode characters being replaced by question marks after copy and paste on Windows
I have this odd issue that I've been having for years now since Windows 8 (I'm now on Windows 10, if I remember right. The problem only seems to be on my work computer. My other personal computers don't seem to have a problem. I didn't think about asking for help at first because I found a workaround - which I'll explain later - but I think enough is enough.
Basically, whenever I copy and paste Unicode text (Japanese, Arabic, etc.) they appear as question marks on paste. Here's an example Japanese text that I'll copy:
何これ?!意味わかない!
And here's what it looks like after pasting:
????!??????!
Interestingly, if I copy the exact same text again for at least one more time, it will paste properly...
何これ?!意味わかない!
Removing even a single character from the selection before copy will cause the issue to "reset".
This was my workaround. It's not too difficult to do but I tend to always forget to do it because my other PCs work fine. This adds more steps and wastes precious seconds.
The problem is system-wide and affects all the programs and apps I use.
Any idea how to fix this permanently? Any help will be highly appreciated.
I've "suffered" from this issue for years and I never knew the fix was so dead simple until Sanny menitoned "locale" in a comment above (Thanks Sanny!). Haha! Anyway, here's how to fix it if you come upon the same issue as I did:
This applies to Windows 10 (build 15002) but it may be similar to older (or newer) versions of Windows.
- Go to the Region settings in the Control Panel. There are several ways to do this and here's a few of them.
- In the Search bar (Cortana) on the taskbar, search for "Control Panel". In the Control Panel, click on Change date, time, or number formats under Clock, Language and Region in category view or Region in icon list view
- Windows 10 only: In the Search bar again, search for "region & language settings". This will open the Region & Language page in the Settings app. Scroll down until you find Additional date, time, & region settings. You may then select Region on the Control Panel window that opens.
- Open the Administrative tab and click on the Change system locale button. Choose a locale that is different to your current locale. I went with Japanese. I think choosing the language you will copy-paste often would be best, though it may be the same regardless. Acknowledge the change with OK.
- The system will ask you to restart which you'll obviously need to do to notice the changes.
- After restarting, test if copy-paste now works as intended. Upon success, you may re-do the above steps again and switch back to the locale you actually need to use.
That's it! Enjoy copy-pasting! ;)
Microsoft's products are all Unicode compliant. It doesn't make sense that you have to change your locale to fix the issue.
The ????? indicates that Unicode or UTF-8 is not being recognized properly (rather than being misdiagnosed as a different charset (perhaps between the program and the clipboard).
But it seems that this is an actual bug - it seems like the OS thought it was ASCII the first time, but then tried again with UTF-8. The Unicode world is very complex - to store full Unicode in every possible charset, you would need double the space and convert all your functions to be UTF-16 compliant - a massive undertaking and not very practical - imagine the storage and processing you will need to convert to UTF-16/32 - we're talking every document you own or view...so practically we use UTF-8 which encodes the standard UTF to 8-bit. But legacy functions and ASCII-based docs need to be converted to UTF-ASCII etc. What was going on here I surmise is that the OS 'guessed' incorrectly that the encoding was ASCII and used a separate function/class to copy and paste (or the default function/class). Once it 'realized' the encoding was Unicode it used that encoding. While Unicode UTF-8 is the standard there are on average about 3-4 different encodings per language for an OS to do deal with, and without knowing ahead of time what the encoding is - having to determine that is pretty hard.
From a computer's perspective, your character just looks like a pre-determined set of 1's and 0's and there is no way of objectively knowing what's the correct conversion that 01000001 is an 'A' for example. It could also be an א in Hebrew or some other character. Unicode changed all of that - each character has a unique 8-bit assignment which means you can determine what it is based on the encoding range.
So the misbehaving copy and paste probably has to do with legacy functionality with ASCII - upgrade and it should solve the problem!