Do compilers produce better code for do-while loops versus other types of loops?
There's a comment in the zlib compression library (which is used in the Chromium project among many others) which implies that a do-while loop in C generates "better" code on most compilers. Here is the snippet of code where it appears.
do {
} while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
scan < strend);
/* The funny "do {}" generates better code on most compilers */
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/zlib/deflate.c&l=1225
Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?
Update: Mark Adler, one of the original authors, gave a bit of context in the comments.
First of all:
A do-while
loop is not the same as a while
-loop or a for
-loop.
-
while
andfor
loops may not run the loop body at all. - A
do-while
loop always runs the loop body at least once - it skips the initial condition check.
So that's the logical difference. That said, not everyone strictly adheres to this. It is quite common for while
or for
loops to be used even when it is guaranteed that it will always loop at least once. (Especially in languages with foreach loops.)
So to avoid comparing apples and oranges, I'll proceed assuming that the loop will always run at least once. Furthermore, I won't mention for
loops again since they are essentially while
loops with a bit of syntax sugar for a loop counter.
So I'll be answering the question:
If a while
loop is guaranteed to loop at least once, is there any performance gain from using a do-while
loop instead.
A do-while
skips the first condition check. So there is one less branch and one less condition to evaluate.
If the condition is expensive to check, and you know you're guaranteed to loop at least once, then a do-while
loop could be faster.
And while this is considered a micro-optimization at best, it is one that the compiler can't always do: Specifically when the compiler is unable to prove that the loop will always enter at least once.
In other words, a while-loop:
while (condition){
body
}
Is effectively the same as this:
if (condition){
do{
body
}while (condition);
}
If you know that you will always loop at least once, that if-statement is extraneous.
Likewise at the assembly level, this is roughly how the different loops compile to:
do-while loop:
start:
body
test
conditional jump to start
while-loop:
test
conditional jump to end
start:
body
test
conditional jump to start
end:
Note that the condition has been duplicated. An alternate approach is:
unconditional jump to end
start:
body
end:
test
conditional jump to start
... which trades away the duplicate code for an additional jump.
Either way, it's still worse than a normal do-while
loop.
That said, compilers can do what they want. And if they can prove that the loop always enters once, then it has done the work for you.
But things are bit weird for the particular example in the question because it has an empty loop body. Since there is no body, there's no logical difference between while
and do-while
.
FWIW, I tested this in Visual Studio 2012:
With the empty body, it does actually generate the same code for
while
anddo-while
. So that part is likely a remnant of the old days when compilers weren't as great.But with a non-empty body, VS2012 manages to avoid duplication of the condition code, but still generates an extra conditional jump.
So it's ironic that while the example in the question highlights why a do-while
loop could be faster in the general case, the example itself doesn't seem to give any benefit on a modern compiler.
Considering how old the comment was, we can only guess at why it would matter. It's very possible that the compilers at the time weren't capable of recognizing that the body was empty. (Or if they did, they didn't use the information.)
Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?
Not much, unless you look at the actual generated assembly of an actual, specific compiler on a specific platform with some specific optimization settings.
This was probably worth worrying about decades ago (when ZLib has been written), but certainly not nowadays, unless you found, by real profiling, that this removes a bottleneck from your code.
In a nutshell (tl;dr):
I'm interpreting the comment in OPs' code a little differently, I think the "better code" they claim to have observed was due to moving the actual work into the loop "condition". I completely agree however that it's very compiler specific and that the comparison they made, while being able to produce a slightly different code, is mostly pointless and probably obsolete, as I show below.
Details:
It's hard to say what the original author meant by his comment about this do {} while
producing better code, but i'd like to speculate in another direction than what was raised here - we believe that the difference between do {} while
and while {}
loops is pretty slim (one less branch as Mystical said), but there's something even "funnier" in this code and that's putting all the work inside this crazy condition, and keeping the internal part empty (do {}
).
I've tried the following code on gcc 4.8.1 (-O3), and it gives an interesting difference -
#include "stdio.h"
int main (){
char buf[10];
char *str = "hello";
char *src = str, *dst = buf;
char res;
do { // loop 1
res = (*dst++ = *src++);
} while (res);
printf ("%s\n", buf);
src = str;
dst = buf;
do { // loop 2
} while (*dst++ = *src++);
printf ("%s\n", buf);
return 0;
}
After compiling -
00000000004003f0 <main>:
...
; loop 1
400400: 48 89 ce mov %rcx,%rsi
400403: 48 83 c0 01 add $0x1,%rax
400407: 0f b6 50 ff movzbl 0xffffffffffffffff(%rax),%edx
40040b: 48 8d 4e 01 lea 0x1(%rsi),%rcx
40040f: 84 d2 test %dl,%dl
400411: 88 16 mov %dl,(%rsi)
400413: 75 eb jne 400400 <main+0x10>
...
;loop 2
400430: 48 83 c0 01 add $0x1,%rax
400434: 0f b6 48 ff movzbl 0xffffffffffffffff(%rax),%ecx
400438: 48 83 c2 01 add $0x1,%rdx
40043c: 84 c9 test %cl,%cl
40043e: 88 4a ff mov %cl,0xffffffffffffffff(%rdx)
400441: 75 ed jne 400430 <main+0x40>
...
So the first loop does 7 instructions while the second does 6, even though they're supposed to do the same work. Now, I can't really tell if there's some compiler smartness behind this, probably not and it's just coincidental but I haven't checked how it interacts with other compiler options this project might be using.
On clang 3.3 (-O3) on the other hand, both loops generate this 5 instructions code :
400520: 8a 88 a0 06 40 00 mov 0x4006a0(%rax),%cl
400526: 88 4c 04 10 mov %cl,0x10(%rsp,%rax,1)
40052a: 48 ff c0 inc %rax
40052d: 48 83 f8 05 cmp $0x5,%rax
400531: 75 ed jne 400520 <main+0x20>
Which just goes to show that compilers are quite different, and advancing at a far faster rate than some programmer may have anticipated several years ago. It also means that this comment is pretty meaningless and probably there because no one had ever checked if it still makes sense.
Bottom line - if you want to optimize to the best possible code (and you know how it should look like), do it directly in assembly and cut the "middle-man" (compiler) from the equation, but take into account that newer compilers and newer HW might make this optimization obsolete. In most cases it's far better to just let the compiler do that level of work for you, and focus on optimizing the big stuff.
Another point that should be made - instruction count (assuming this is what the original OPs' code was after), is by no means a good measurement for code efficiency. Not all instructions were created equal, and some of them (simple reg-to-reg moves for e.g.) are really cheap as they get optimized by the CPU. Other optimization might actually hurt CPU internal optimizations, so eventually only proper benchmarking counts.