Why are arguments which do not match the conversion specifier in printf undefined behavior?
In both C (n1570 7.21.6.1/10) and C++ (by inclusion of the C standard library) it is undefined behavior to provide an argument to printf whose type does not match its conversion specification. A simple example:
printf("%d", 1.9)
The format string specifies an int, while the argument is a floating point type.
This question is inspired by the question of a user who encountered legacy code with an abundance of conversion mismatches which apparently did no harm, cf. undefined behaviour in theory and in practice.
Declaring a mere format mismatch UB seems drastic at first. It is clear that the output can be wrong, depending on things like the exact mismatch, argument types, endianness, possibly stack layout and other issues. This extends, as one commentator there pointed out, also to subsequent (or even previous?) arguments. But that is far from general UB. Personally, I never encountered anything else but the expected wrong output.
To venture a guess, I would exclude alignment issues. What I can imagine is that providing a format string which makes printf expect large data together with small actual arguments possibly lets printf
read beyond the stack, but I lack deeper insight in the var args mechanism and specific printf implementation details to verify that.
I had a quick look at the printf sources, but they are pretty opaque to the casual reader.
Therefore my question: What are the specific dangers of mis-matching conversion specifiers and arguments in printf
which make it UB?
printf
only works as described by the standard if you use it correctly. If you use it incorrectly, the behaviour is undefined. Why should the standard define what happens when you use it wrong?
Concretely, on some architectures floating point arguments are passed in different registers to integer arguments, so inside printf
when it tries to find an int
matching the format specifier it will find garbage in the corresponding register. Since those details are outside the scope of the standard there is no way to deal with that kind of misbehaviour except to say it's undefined.
For an example of how badly it could go wrong, using a format specifier of "%p"
but passing a floating point type could mean that printf
tries to read a pointer from a register or stack location which hasn't been set to a valid value and could contain a trap representation, which would cause the program to abort.
Some compilers may implement variable-format arguments in a way that allows the types of arguments to be validated; since having a program trap on incorrect usage may be better than possibly having it output seemingly-valid-but-wrong information, some platforms may choose to do that.
Because the behavior of traps is outside the realm of the C Standard, any action which might plausibly trap is classified as invoking Undefined Behavior.
Note that the possibility of implementations trapping based on incorrect formatting means that behavior is considered undefined even in cases where the expected type and the actual passed type have the same representation, except that signed and unsigned numbers of the same rank are interchangeable if the values they hold are within the range which is common to both [i.e. if a "long" holds 23, it may be output with "%lX" but not with "%X" even if "int" and "long" are the same size].
Note also that the C89 committee introduced a rule by fiat, which remains to this day, which states that even if "int" and "long" have the same format, the code:
long foo=23;
int *u = &foo;
(*u)++;
invokes Undefined Behavior since it causes information which was written as type "long" to be read as type "int" (behavior would also be Undefined if it was type "unsigned int"). Since a "%X" format specifier would cause data to be read as type "unsigned int", passing the data as type "long" would almost certainly cause the data to be stored somewhere as "long" but subsequently read as type "unsigned int", such behavior would almost likely violate the aforementioned rule.
Just to take your example: suppose that your architecture's procedure call standard says that floating-point arguments are passed in floating-point registers. But printf
thinks you are passing an integer, because of the %d
format specifier. So it expects an argument on the call stack, which isn't there. Now anything can happen.
Any printf
format/argument mismatch will cause erroneous output, so you cannot rely on anything once you do that. It is hard to tell which will have dire consequences beyond garbage output because it depends completely no the specifics of the platform you are compiling for and the actual details of the printf
implementation.
Passing invalid arguments to a printf
instance that has a %s
format can cause invalid pointers to be dereferenced. But invalid arguments for simpler types such as int
or double
can cause alignment errors with similar consequences.