What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?
I was solving some problem on codeforces. Normally I first check if the character is upper or lower English letter then subtract or add 32
to convert it to the corresponding letter. But I found someone do ^= 32
to do the same thing. Here it is:
char foo = 'a';
foo ^= 32;
char bar = 'A';
bar ^= 32;
cout << foo << ' ' << bar << '\n'; // foo is A, and bar is a
I have searched for an explanation for this and didn't find out. So why this works?
Solution 1:
Let's take a look at ASCII code table in binary.
A 1000001 a 1100001
B 1000010 b 1100010
C 1000011 c 1100011
...
Z 1011010 z 1111010
And 32 is 0100000
which is the only difference between lowercase and uppercase letters. So toggling that bit toggles the case of a letter.
Solution 2:
This uses the fact than ASCII values have been chosen by really smart people.
foo ^= 32;
This flips the 6th lowest bit1 of foo
(the uppercase flag of ASCII sort of), transforming an ASCII upper case to a lower case and vice-versa.
+---+------------+------------+
| | Upper case | Lower case | 32 is 00100000
+---+------------+------------+
| A | 01000001 | 01100001 |
| B | 01000010 | 01100010 |
| ... |
| Z | 01011010 | 01111010 |
+---+------------+------------+
Example
'A' ^ 32
01000001 'A'
XOR 00100000 32
------------
01100001 'a'
And by property of XOR, 'a' ^ 32 == 'A'
.
Notice
C++ is not required to use ASCII to represent characters. Another variant is EBCDIC. This trick only works on ASCII platforms. A more portable solution would be to use std::tolower
and std::toupper
, with the offered bonus to be locale-aware (it does not automagically solve all your problems though, see comments):
bool case_incensitive_equal(char lhs, char rhs)
{
return std::tolower(lhs, std::locale{}) == std::tolower(rhs, std::locale{}); // std::locale{} optional, enable locale-awarness
}
assert(case_incensitive_equal('A', 'a'));
1) As 32 is 1 << 5
(2 to the power 5), it flips the 6th bit (counting from 1).
Solution 3:
Allow me to say that this is -- although it seems smart -- a really, really stupid hack. If someone recommends this to you in 2019, hit him. Hit him as hard as you can.
You can, of course, do it in your own software that you and nobody else uses if you know that you will never use any language but English anyway. Otherwise, no go.
The hack was arguable "OK" some 30-35 years ago when computers didn't really do much but English in ASCII, and maybe one or two major European languages. But... no longer so.
The hack works because US-Latin upper- and lowercases are exactly 0x20
apart from each other and appear in the same order, which is just one bit of difference. Which, in fact, this bit hack, toggles.
Now, the people creating code pages for Western Europe, and later the Unicode consortium, were smart enough to keep this scheme for e.g. German Umlauts and French-accented Vowels. Not so for ß which (until someone convinced the Unicode consortium in 2017, and a large Fake News print magazine wrote about it, actually convincing the Duden -- no comment on that) don't even exist as a versal (transforms to SS). Now it does exist as versal, but the two are 0x1DBF
positions apart, not 0x20
.
The implementors were, however, not considerate enough to keep this going. For example, if you apply your hack in some East European languages or the like (I wouldn't know about Cyrillic), you will get a nasty surprise. All those "hatchet" characters are examples of that, lowercase and uppercase are one apart. The hack thus does not work properly there.
There's much more to consider, for example, some characters do not simply transform from lower- to uppercase at all (they're replaced with different sequences), or they may change form (requiring different code points).
Do not even think about what this hack will do to stuff like Thai or Chinese (it'll just give you complete nonsense).
Saving a couple of hundred CPU cycles may have been very worthwhile 30 years ago, but nowadays, there is really no excuse for converting a string properly. There are library functions for performing this non-trivial task.
The time taken to convert several dozens kilobytes of text properly is negligible nowadays.
Solution 4:
It works because, as it happens, the difference between 'a' and A' in ASCII and derived encodings is 32, and 32 is also the value of the sixth bit. Flipping the 6th bit with an exclusive OR thus converts between upper and lower.