Is the `if` statement redundant before modulo and before assign operations?

If you want to understand what the compiler is doing, you'll need to just pull up some assembly. I recommend this site (I already entered code from the question)): https://godbolt.org/g/FwZZOb.

The first example is more interesting.

int div(unsigned int num, unsigned int num2) {
    if( num >= num2 ) return num % num2;
    return num;
}

int div2(unsigned int num, unsigned int num2) {
    return num % num2;
}

Generates:

div(unsigned int, unsigned int):          # @div(unsigned int, unsigned int)
        mov     eax, edi
        cmp     eax, esi
        jb      .LBB0_2
        xor     edx, edx
        div     esi
        mov     eax, edx
.LBB0_2:
        ret

div2(unsigned int, unsigned int):         # @div2(unsigned int, unsigned int)
        xor     edx, edx
        mov     eax, edi
        div     esi
        mov     eax, edx
        ret

Basically, the compiler will not optimize away the branch, for very specific and logical reasons. If integer division was about the same cost as comparison, then the branch would be pretty pointless. But integer division (which modulus is performed together with typically) is actually very expensive: http://www.agner.org/optimize/instruction_tables.pdf. The numbers vary greatly by architecture and integer size but it typically could be a latency of anywhere from 15 to close to 100 cycles.

By taking a branch before performing the modulus, you can actually save yourself a lot of work. Notice though: the compiler also does not transform the code without a branch into a branch at the assembly level. That's because the branch has a downside too: if the modulus ends up being necessary anyway, you just wasted a bit of time.

There's no way to make a reasonable determination about the correct optimization without knowing the relative frequency with which idx < idx_max will be true. So the compilers (gcc and clang do the same thing) opt to map the code in a relatively transparent way, leaving this choice in the hands of the developer.

So that branch might have been a very reasonable choice.

The second branch should be completely pointless, because comparison and assignment are of comparable cost. That said, you can see in the link that compilers will still not perform this optimization if they have a reference to the variable. If the value is a local variable (as in your demonstrated code) then the compiler will optimize the branch away.

In sum the first piece of code is perhaps a reasonable optimization, the second, probably just a tired programmer.

There are a number of situations where writing a variable with a value it already holds may be slower than reading it, finding out already holds the desired value, and skipping the write. Some systems have a processor cache which sends all write requests to memory immediately. While such designs aren't commonplace today, they used to be quite common since they can offer a substantial fraction of the performance boost that full read/write caching can offer, but at a small fraction of the cost.

Code like the above can also be relevant in some multi-CPU situations. The most common such situation would be when code running simultaneously on two or more CPU cores will be repeatedly hitting the variable. In a multi-core caching system with a strong memory model, a core that wants to write a variable must first negotiate with other cores to acquire exclusive ownership of the cache line containing it, and must then negotiate again to relinquish such control the next time any other core wants to read or write it. Such operations are apt to be very expensive, and the costs will have to be borne even if every write is simply storing the value the storage already held. If the location becomes zero and is never written again, however, both cores can hold the cache line simultaneously for non-exclusive read-only access and never have to negotiate further for it.

In almost all situations where multiple CPUs could be hitting a variable, the variable should at minimum be declared volatile. The one exception, which might be applicable here, would be in cases where all writes to a variable that occur after the start of main() will store the same value, and code would behave correctly whether or not any store by one CPU was visible in another. If doing some operation multiple times would be wasteful but otherwise harmless, and the purpose of the variable is to say whether it needs to be done, then many implementations may be able to generate better code without the volatile qualifier than with, provided that they don't try to improve efficiency by making the write unconditional.

Incidentally, if the object were accessed via pointer, there would be another possible reason for the above code: if a function is designed to accept either a const object where a certain field is zero, or a non-const object which should have that field set to zero, code like the above might be necessary to ensure defined behavior in both cases.

Regards first block of code: this is a micro-optimization based on Chandler Carruth's recommendations for Clang (see here for more info), however it doesn't necessarily hold that it would be a valid micro-optimization in this form (using if rather than ternary) or on any given compiler.

Modulo is a reasonably expensive operation, if the code is being executed often and there is a strong statistical lean to one side or the other of the conditional, the CPU's branch prediction (given a modern CPU) will significantly reduce the cost of the branch instruction.

It seems a bad idea to use the if there, to me.

You are right. Whether or not idx >= idx_max, it will be under idx_max after idx %= idx_max. If idx < idx_max, it will be unchanged, whether the if is followed or not.

While you might think branching around the modulo might save time, real culprit, I'd say, is that when branches are followed, pipelining modern CPU's have to reset their pipeline, and that costs a relative lot of time. Better not to have to follow a branch, than do an integer modulo, which costs roughly as much time as an integer division.

EDIT: It turns out that the modulus is pretty slow vs. the branch, as suggested by others here. Here's a guy examining this exact same question: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" (suggested in another SO question linked to in another answer to this question).

This guy writes compilers, and thought it would be faster without the branch; but his benchmarks proved him wrong. Even when the branch was taken only 20% of the time, it tested faster.

Another reason not to have the if: One less line of code to maintain, and for someone else to puzzle out what it means. The guy in the above link actually created a "faster modulus" macro. IMHO, this or an inline function is the way to go for performance-critical applications, because your code will be ever so much more understandable without the branch, but will execute as fast.

Finally, the guy in the above video is planning to make this optimization known to compiler writers. Thus, the if will probably be added for you, if not in the code. Hence, just the mod alone will do, when this comes about.

Is the `if` statement redundant before modulo and before assign operations?

Related

Recent Posts