Why does adding inline assembly comments cause such radical change in GCC's generated code?

Solution 1:

The interactions with optimisations are explained about halfway down the "Assembler Instructions with C Expression Operands" page in the documentation.

GCC doesn't try to understand any of the actual assembly inside the asm; the only thing it knows about the content is what you (optionally) tell it in the output and input operand specification and the register clobber list.

In particular, note:

An asm instruction without any output operands will be treated identically to a volatile asm instruction.

and

The volatile keyword indicates that the instruction has important side-effects [...]

So the presence of the asm inside your loop has inhibited a vectorisation optimisation, because GCC assumes it has side effects.

Solution 2:

Note that gcc vectorized the code, splitting the loop body into two parts, the first processing 16 items at a time, and the second doing the remainder later.

As Ira commented, the compiler doesn't parse the asm block, so it does not know that it's just a comment. Even if it did, it has no way of knowing what you intended. The optmized loops have the body doubled, should it put your asm in each? Would you like it that it isn't executed 1000 times? It doesn't know, so it goes the safe route and falls back to the simple single loop.

Solution 3:

I don't agree with the "gcc doesn't understand what is in the asm() block". For example, gcc can deal quite well with optimising parameters, and even re-arranging asm() blocks such that it intermingles with the generated C code. This is why, if you look at inline assembler in for example the Linux kernel, it is nearly always prefixed with __volatile__ to ensure that the compiler "doesn't move the code around". I have had gcc move my "rdtsc" around, which made my measurements of the time it took to do certain thing.

As documented, gcc treats certain types of asm() blocks as "special", and thus doesn't optimise the code either side of the block.

That's not to say that gcc won't, sometimes, get confused by inline assembler blocks, or simply decide to give up on some particular optimisation because it can't follow the consequences of the assembler code, etc, etc. More importantly, it can often get confused by missing clobber tags - so if you have some instruction like cpuid that changes the value of EAX-EDX, it but you wrote the code so that it only uses EAX, the compiler may store things in EBX, ECX and EDX, and then your code acts very strange when these registers are overwritten... If you are lucky, it crashes immediately - then it's easy to figure out what goes on. But if you are unlucky, it crashes way down the line... Another tricky one is the divide instruction that give a second result in edx. If you don't care about the modulo, it's easy to forget that EDX was changed.