In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:

Source unoptimized.cpp

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Source optimized.cpp

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    const unsigned width = static_cast<unsigned>(bitmap.width);
    for (unsigned x = 0 ; x < width ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Compilation

  • $ g++ -s -O3 unoptimized.cpp
  • $ g++ -s -O3 optimized.cpp

Assembly (unoptimized.s)

    .file   "unoptimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    mov %eax, %edx
    addl    $1, %eax
    movq    (%rsi,%rdx,8), %rdx
    movb    $0, (%rdx)
    cmpl    bitmap(%rip), %eax
    jb  .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

Assembly (optimized.s)

    .file   "optimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    subl    $1, %eax
    leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    movq    (%rsi,%rax), %rdx
    addq    $8, %rax
    cmpq    %rcx, %rax
    movb    $0, (%rdx)
    jne .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

diff

$ diff -uN unoptimized.s optimized.s
--- unoptimized.s   2015-11-24 16:11:55.837922223 +0000
+++ optimized.s 2015-11-24 16:12:02.628922941 +0000
@@ -1,4 +1,4 @@
-   .file   "unoptimized.cpp"
+   .file   "optimized.cpp"
    .text
    .p2align 4,,15
 .globl main
@@ -10,16 +10,17 @@
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
+   subl    $1, %eax
+   leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
 .L3:
-   mov %eax, %edx
-   addl    $1, %eax
-   movq    (%rsi,%rdx,8), %rdx
+   movq    (%rsi,%rax), %rdx
+   addq    $8, %rax
+   cmpq    %rcx, %rax
    movb    $0, (%rdx)
-   cmpl    bitmap(%rip), %eax
-   jb  .L3
+   jne .L3
 .L2:
    xorl    %eax, %eax
    ret

The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).

When I'll get time, I eventually post some benchmark on that. Good question.


There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.

This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.

That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.

That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.

In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.


Ok, guys, so I've measured, with GCC -O3 (using GCC 4.9 on Linux x64).

Turns out, the second version runs 54% faster!

So, I guess aliasing is the thing, I hadn't thought about it.

[Edit]

I've tried again the first version with all pointers defined with __restrict__, and the results are the same. Weird.. Either aliasing is not the problem, or, for some reason, the compiler doesn't optimize it well even with __restrict__.

[Edit 2]

Ok, I think I was pretty much able to prove that aliasing is the problem. I repeated my original test, this time using an array rather than a pointer:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

And measured (had to use "-mcmodel=large" to link it). Then I tried:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

The measure results were the same - Seems like the compiler was able to optimize it by itself.

Then I tried the original codes (with a pointer p), this time when p is of type std::uint16_t*. Again, the results were the same - due to strict aliasing. Then I tried building with "-fno-strict-aliasing", and again saw a difference in time.


Other answers have pointed out that hoisting the pointer operation out of the loop may change defined behaviour due to aliasing rules that allow char to alias anything and hence is not an allowable optimisation for a compiler even though in most cases it is obviously correct to a human programmer.

They have also pointed out that hoisting the operation out of the loop is usually but not always an improvement from a performance point of view and is often a negative from a readability point of view.

I would like to point out that there is often a "third way". Rather than counting up to the number of iterations you want you can count down to zero. This means that the number of iterations is only needed once at the start of the loop, it doesn't have to be stored after that. Better still at the assembler level it often eliminates the need for an explicit comparison as the decrement operation will usually set flags that indicate whether the counter was zero both before (carry flag) and after (zero flag) the decrement.

for (unsigned x = static_cast<unsigned>(bitmap->width);x > 0;  x--)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Note that this version of the loop gives x values in the range 1..width rather than the range 0..(width-1). That doesn't matter in your case because you aren't actually using x for anything but it's something to be aware of. If you want a count down loop with x values in the range 0..(width-1) you can do.

for (unsigned x = static_cast<unsigned>(bitmap->width); x-- > 0;)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

You can also get rid of the casts in the above examples if you want without worrying about it's impact on comparison rules since all you are doing with bitmap->width is assigning it directly to a variable.