Why does iteration over an inclusive range generate longer assembly in Rust?

These two loops are equivalent in C++ and Rust:

#include <cstdint>
uint64_t sum1(uint64_t n) {  
    uint64_t sum = 0;
    for (uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    for j in 0u64..=num {
        sum += 1;
    }
    return sum;
}

However the C++ version generates a very terse assembly:

sum1(unsigned long):                               # @sum1(unsigned long)
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

while Rust's version is very long with two checks in the loop instead of one:

example::sum1:
        xor     eax, eax
        xor     ecx, ecx
.LBB0_1:
        mov     rdx, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        add     rax, 1
        cmp     rdx, rdi
        jae     .LBB0_3
        cmp     rcx, rdi
        jbe     .LBB0_1
.LBB0_3:
        ret

Godbolt: https://godbolt.org/z/xYW94qxjK

What is Rust intrinsically trying to prevent that C++ is carefree about?

Overflow in the iterator state.

The C++ version will loop forever when given a large enough input:

#include <cstdint>

[[gnu::noinline]]
std::uint64_t sum1(std::uint64_t n) {  
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}

#include <iostream>

int main() {
    std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;

    return 0;
}

This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.

Strictly speaking, invoking sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow itself, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why I applied the [[gnu::noinline]] attribute to the function to prevent the compiler from taking ‘advantage’ of that in optimisation passes.

The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:

use std::num::Wrapping;

fn sum1(num: u64) -> u64 {
    let mut sum = Wrapping(0);
    for _ in 0..=num {
        sum += Wrapping(1);
    }
    return sum.0;
}

fn main() {
    println!("{}", sum1(0xffffffff_ffffffff));
}

The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18446744073709551616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.

@user3840170 correctly identified the difference: overflow check in Rust, and not in C++.

Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM¹.

The optimal code generation for short loops, is to split the loop, transforming:

for j in 0u64..=num {
    sum += 1;
}

Into:

for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
    sum += 1;
}

if 0 <= num {
    sum += 1;
}

This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula².

The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.

Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.

¹See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.

²Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.

These two loops are equivalent in C++ and Rust

Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).

A rust equivalent code could be:

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    let mut j = 0;
    while j <= num {
        sum = sum.wrapping_add(1);
        j = j.wrapping_add(1);
    }
    sum
}

currently produce the following asm godbolt:

example::sum1:
        xor     eax, eax
.LBB0_1:
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

Why does iteration over an inclusive range generate longer assembly in Rust?

Related

Recent Posts