Do compilers produce better code for do-while loops versus other types of loops?

There's a comment in the zlib compression library (which is used in the Chromium project among many others) which implies that a do-while loop in C generates "better" code on most compilers. Here is the snippet of code where it appears.

do {
} while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         scan < strend);
/* The funny "do {}" generates better code on most compilers */

https://code.google.com/p/chromium/codesearch#chromium/src/third_party/zlib/deflate.c&l=1225

Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?

Update: Mark Adler, one of the original authors, gave a bit of context in the comments.


First of all:

A do-while loop is not the same as a while-loop or a for-loop.

  • while and for loops may not run the loop body at all.
  • A do-while loop always runs the loop body at least once - it skips the initial condition check.

So that's the logical difference. That said, not everyone strictly adheres to this. It is quite common for while or for loops to be used even when it is guaranteed that it will always loop at least once. (Especially in languages with foreach loops.)

So to avoid comparing apples and oranges, I'll proceed assuming that the loop will always run at least once. Furthermore, I won't mention for loops again since they are essentially while loops with a bit of syntax sugar for a loop counter.

So I'll be answering the question:

If a while loop is guaranteed to loop at least once, is there any performance gain from using a do-while loop instead.


A do-while skips the first condition check. So there is one less branch and one less condition to evaluate.

If the condition is expensive to check, and you know you're guaranteed to loop at least once, then a do-while loop could be faster.

And while this is considered a micro-optimization at best, it is one that the compiler can't always do: Specifically when the compiler is unable to prove that the loop will always enter at least once.


In other words, a while-loop:

while (condition){
    body
}

Is effectively the same as this:

if (condition){
    do{
        body
    }while (condition);
}

If you know that you will always loop at least once, that if-statement is extraneous.


Likewise at the assembly level, this is roughly how the different loops compile to:

do-while loop:

start:
    body
    test
    conditional jump to start

while-loop:

    test
    conditional jump to end
start:
    body
    test
    conditional jump to start
end:

Note that the condition has been duplicated. An alternate approach is:

    unconditional jump to end
start:
    body
end:
    test
    conditional jump to start

... which trades away the duplicate code for an additional jump.

Either way, it's still worse than a normal do-while loop.

That said, compilers can do what they want. And if they can prove that the loop always enters once, then it has done the work for you.


But things are bit weird for the particular example in the question because it has an empty loop body. Since there is no body, there's no logical difference between while and do-while.

FWIW, I tested this in Visual Studio 2012:

  • With the empty body, it does actually generate the same code for while and do-while. So that part is likely a remnant of the old days when compilers weren't as great.

  • But with a non-empty body, VS2012 manages to avoid duplication of the condition code, but still generates an extra conditional jump.

So it's ironic that while the example in the question highlights why a do-while loop could be faster in the general case, the example itself doesn't seem to give any benefit on a modern compiler.

Considering how old the comment was, we can only guess at why it would matter. It's very possible that the compilers at the time weren't capable of recognizing that the body was empty. (Or if they did, they didn't use the information.)


Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?

Not much, unless you look at the actual generated assembly of an actual, specific compiler on a specific platform with some specific optimization settings.

This was probably worth worrying about decades ago (when ZLib has been written), but certainly not nowadays, unless you found, by real profiling, that this removes a bottleneck from your code.


In a nutshell (tl;dr):

I'm interpreting the comment in OPs' code a little differently, I think the "better code" they claim to have observed was due to moving the actual work into the loop "condition". I completely agree however that it's very compiler specific and that the comparison they made, while being able to produce a slightly different code, is mostly pointless and probably obsolete, as I show below.


Details:

It's hard to say what the original author meant by his comment about this do {} while producing better code, but i'd like to speculate in another direction than what was raised here - we believe that the difference between do {} while and while {} loops is pretty slim (one less branch as Mystical said), but there's something even "funnier" in this code and that's putting all the work inside this crazy condition, and keeping the internal part empty (do {}).

I've tried the following code on gcc 4.8.1 (-O3), and it gives an interesting difference -

#include "stdio.h" 
int main (){
    char buf[10];
    char *str = "hello";
    char *src = str, *dst = buf;

    char res;
    do {                            // loop 1
        res = (*dst++ = *src++);
    } while (res);
    printf ("%s\n", buf);

    src = str;
    dst = buf;
    do {                            // loop 2
    } while (*dst++ = *src++);
    printf ("%s\n", buf);

    return 0; 
}

After compiling -

00000000004003f0 <main>:
  ... 
; loop 1  
  400400:       48 89 ce                mov    %rcx,%rsi
  400403:       48 83 c0 01             add    $0x1,%rax
  400407:       0f b6 50 ff             movzbl 0xffffffffffffffff(%rax),%edx
  40040b:       48 8d 4e 01             lea    0x1(%rsi),%rcx
  40040f:       84 d2                   test   %dl,%dl
  400411:       88 16                   mov    %dl,(%rsi)
  400413:       75 eb                   jne    400400 <main+0x10>
  ...
;loop 2
  400430:       48 83 c0 01             add    $0x1,%rax
  400434:       0f b6 48 ff             movzbl 0xffffffffffffffff(%rax),%ecx
  400438:       48 83 c2 01             add    $0x1,%rdx
  40043c:       84 c9                   test   %cl,%cl
  40043e:       88 4a ff                mov    %cl,0xffffffffffffffff(%rdx)
  400441:       75 ed                   jne    400430 <main+0x40>
  ...

So the first loop does 7 instructions while the second does 6, even though they're supposed to do the same work. Now, I can't really tell if there's some compiler smartness behind this, probably not and it's just coincidental but I haven't checked how it interacts with other compiler options this project might be using.


On clang 3.3 (-O3) on the other hand, both loops generate this 5 instructions code :

  400520:       8a 88 a0 06 40 00       mov    0x4006a0(%rax),%cl
  400526:       88 4c 04 10             mov    %cl,0x10(%rsp,%rax,1)
  40052a:       48 ff c0                inc    %rax
  40052d:       48 83 f8 05             cmp    $0x5,%rax
  400531:       75 ed                   jne    400520 <main+0x20>

Which just goes to show that compilers are quite different, and advancing at a far faster rate than some programmer may have anticipated several years ago. It also means that this comment is pretty meaningless and probably there because no one had ever checked if it still makes sense.


Bottom line - if you want to optimize to the best possible code (and you know how it should look like), do it directly in assembly and cut the "middle-man" (compiler) from the equation, but take into account that newer compilers and newer HW might make this optimization obsolete. In most cases it's far better to just let the compiler do that level of work for you, and focus on optimizing the big stuff.

Another point that should be made - instruction count (assuming this is what the original OPs' code was after), is by no means a good measurement for code efficiency. Not all instructions were created equal, and some of them (simple reg-to-reg moves for e.g.) are really cheap as they get optimized by the CPU. Other optimization might actually hurt CPU internal optimizations, so eventually only proper benchmarking counts.