Array-syntax vs pointer-syntax and code generation?
The quote is just wrong. Pretty tragic that such garbage is still published in this decade. In fact, the C Standard defines x[y]
as *(x+y)
.
The part about lvalues later on the page is completely and utterly wrong too.
IMHO, the best way to use this book is to put it in a recycling bin or burn it.
I've got 2 C files: ex1.c
% cat ex1.c
#include <stdio.h>
int main (void) {
int vector[5] = { 1, 2, 3, 4, 5 };
printf("%d\n", vector[3]);
}
and ex2.c
,
% cat ex2.c
#include <stdio.h>
int main (void) {
int vector[5] = { 1, 2, 3, 4, 5 };
printf("%d\n", *(vector + 3));
}
And I compile both into assembly, and show the difference in generated assembly code
% gcc -S ex1.c; gcc -S ex2.c; diff -u ex1.s ex2.s
--- ex1.s 2018-07-17 08:19:25.425826813 +0300
+++ ex2.s 2018-07-17 08:19:25.441826756 +0300
@@ -1,4 +1,4 @@
- .file "ex1.c"
+ .file "ex2.c"
.text
.section .rodata
.LC0:
Q.E.D.
The C standard very explicitly states (C11 n1570 6.5.2.1p2):
- A postfix expression followed by an expression in square brackets
[]
is a subscripted designation of an element of an array object. The definition of the subscript operator[]
is thatE1[E2]
is identical to(*((E1)+(E2)))
. Because of the conversion rules that apply to the binary+
operator, ifE1
is an array object (equivalently, a pointer to the initial element of an array object) andE2
is an integer,E1[E2]
designates theE2
-th element ofE1
(counting from zero).
Additionally, the as-if rule applies here - if the behaviour of the program is the same, the compiler can generate the same code even though the semantics weren't the same.
The cited passage is quite wrong. The expressions vector[i]
and *(vector+i)
are perfectly identical and can be expected to generate identical code under all circumstances.
The expressions vector[i]
and *(vector+i)
are identical by definition. This is a central and fundamental property of the C programming language. Any competent C programmer understands this. Any author of a book entitled Understand and Using C Pointers must understand this. Any author of a C compiler will understand this. The two fragments will generate identical code not by accident, but because virtually any C compiler will, in effect, translate one form into the other almost immediately, such that by the time it gets to its code generation phase, it won't even know which form had been used initially. (I would be pretty surprised if a C compiler ever generated significantly different code for vector[i]
as opposed to *(vector+i)
.)
And in fact, the cited text contradicts itself. As you noted, the two passages
The notation
vector[i]
generates machine code that starts at locationvector
, movesi
positions from this location, and uses its content.
and
The notation
*(vector+i)
generates machine code that starts at locationvector
, addsi
to the address, and then uses the contents at that address.
say basically the same thing.
His language is eerily similar to that in question 6.2 of the old C FAQ list:
...when the compiler sees the expression
a[3]
, it emits code to start at the location "a
", move three past it, and fetch the character there. When it sees the expressionp[3]
, it emits code to start at the location "p
", fetch the pointer value there, add three to the pointer, and finally fetch the character pointed to.
But of course the key difference here is that a
is an array and p
is a pointer. The FAQ list is talking not about a[3]
versus *(a+3)
, but rather about a[3]
(or *(a+3)
) where a
is an array, versus p[3]
(or *(p+3)
) where p
is a pointer. (Of course those two cases generate different code, because arrays and pointers are different. As the FAQ list explains, fetching an address from a pointer variable is fundamentally different from using the address of an array.)
I think what the original text may be referring to is some optimizations which some compiler may or may not perform.
Example:
for ( int i = 0; i < 5; i++ ) {
vector[i] = something;
}
vs.
for ( int i = 0; i < 5; i++ ) {
*(vector+i) = something;
}
In the first case, an optimizing compiler may detect that the array vector
is iterated over element by element and thus generate something like
void* tempPtr = vector;
for ( int i = 0; i < 5; i++ ) {
*((int*)tempPtr) = something;
tempPtr += sizeof(int); // _move_ the pointer; simple addition of a constant.
}
It might even be able to use the target CPU's pointer post-increment instructions where available.
For the second case, it is "harder" for the compiler to see that the address that's calculated via some "arbitrary" pointer arithmetics expression shows the same property of monotonically advancing a fixed amount in each iteration. It might thus not find the optimization and calculate ((void*)vector+i*sizeof(int))
in each iteration which uses an additional multiplication. In this case there's no (temporary) pointer that gets "moved" but only a temporary address re-calculated.
However, the statement probably does not universally hold for all C compilers in all versions.
Update:
I checked the above example. It appears that without optimizations enabled at least gcc-8.1 x86-64 generates more code (2 extra instructions) for the second (pointer-arithmethics) form than the first (array index).
See: https://godbolt.org/g/7DaPHG
However, with any optimizations turned on (-O
...-O3
) the generated code is the same (length) for both.
The Standard specifies the behavior of arr[i]
when arr
is an array object as being equivalent to decomposing arr
to a pointer, adding i
, and dereferencing the result. Although the behaviors would be equivalent in all Standard-defined cases, there are some cases where compilers process actions usefully even though the Standard does require it, and the handling of arrayLvalue[i]
and *(arrayLvalue+i)
may differ as a consequence.
For example, given
char arr[5][5];
union { unsigned short h[4]; unsigned int w[2]; } u;
int atest1(int i, int j)
{
if (arr[1][i])
arr[0][j]++;
return arr[1][i];
}
int atest2(int i, int j)
{
if (*(arr[1]+i))
*((arr[0])+j)+=1;
return *(arr[1]+i);
}
int utest1(int i, int j)
{
if (u.h[i])
u.w[j]=1;
return u.h[i];
}
int utest2(int i, int j)
{
if (*(u.h+i))
*(u.w+j)=1;
return *(u.h+i);
}
GCC's generated code for test1 will assume that arr[1][i] and arr[0][j] cannot alias, but the generated code for test2 will allow for pointer arithmetic to access the entire array, On the flip side, gcc will recognize that in utest1, lvalue expressions u.h[i] and u.w[j] both access the same union, but it's not sophisticated enough to notice the same about *(u.h+i) and *(u.w+j) in utest2.