Do I need to cast to unsigned char before calling toupper(), tolower(), et al.?
Solution 1:
Yes, the argument to toupper
needs to be converted to unsigned char
to avoid the risk of undefined behavior.
The types char
, signed char
, and unsigned char
are three distinct types. char
has the same range and representation as either signed char
or unsigned char
. (Plain char
is very commonly signed and able to represent values in the range -128..+127.)
The toupper
function takes an int
argument and returns an int
result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an
int
, the value of which shall be representable as anunsigned char
or shall equal the value of the macroEOF
. If the argument has any other value, the behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The []
indexing operator on std::string
returns a reference to char
. If plain char
is a signed type, and if the value of name[0]
happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char
is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char
value passed to toupper
(or to any of the functions declared in <cctype>
/ <ctype.h>
) needs to be converted to unsigned char
, so that the implicit conversion to int
won't yield a negative value and cause undefined behavior.
The <ctype.h>
functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned
:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int
is 32 bits, converting the char
value -2
to unsigned
yields 4294967294
. This is then implicitly converted to int
(the parameter type), which probably yields -2
.
toupper
can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN
to UCHAR_MAX
), but it's not required to do so. Furthermore, the functions in <ctype.h>
are required to accept an argument with the value EOF
, which is typically -1
.
The C++ standard makes adjustments to some C standard library functions. For example, strchr
and several other functions are replaced by overloaded versions that enforce const
correctness. There are no such adjustments for the functions declared in <cctype>
.
Solution 2:
The reference is referring to the value being representable as an unsigned char
, not to it being an unsigned char
. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX
(typically 255). (Or EOF
, which is basically the reason it takes an int
instead of a char
.)
Solution 3:
In C, toupper
(and many other functions) take int
s even though you'd expect them to take char
s. Additionally, char
is signed on some platforms and unsigned on others.
The advice to cast to unsigned char
before calling toupper
is correct for C. I don't think it's needed in C++, provided you pass it an I can't find anything specific to whether it's needed in C++.int
that's in range.
If you want to sidestep the issue, use the toupper
defined in <locale>
. It's a template, and takes any acceptable character type. You also have to pass it a std::locale
. If you don't have any idea which locale to choose, use std::locale("")
, which is supposed to be the user's preferred locale:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
int main()
{
std::string name("Bjarne Stroustrup");
std::string uppercase;
std::locale loc("");
std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
[&loc](char c) { return std::toupper(c, loc); });
std::cout << name << '\n' << uppercase << '\n';
return 0;
}
Solution 4:
Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc