Why does iteration over an inclusive range generate longer assembly in Rust?
These two loops are equivalent in C++ and Rust:
#include <cstdint>
uint64_t sum1(uint64_t n) {
uint64_t sum = 0;
for (uint64_t j = 0; j <= n; ++j) {
sum += 1;
}
return sum;
}
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
for j in 0u64..=num {
sum += 1;
}
return sum;
}
However the C++ version generates a very terse assembly:
sum1(unsigned long): # @sum1(unsigned long)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret
while Rust's version is very long with two checks in the loop instead of one:
example::sum1:
xor eax, eax
xor ecx, ecx
.LBB0_1:
mov rdx, rcx
cmp rcx, rdi
adc rcx, 0
add rax, 1
cmp rdx, rdi
jae .LBB0_3
cmp rcx, rdi
jbe .LBB0_1
.LBB0_3:
ret
Godbolt: https://godbolt.org/z/xYW94qxjK
What is Rust intrinsically trying to prevent that C++ is carefree about?
Overflow in the iterator state.
The C++ version will loop forever when given a large enough input:
#include <cstdint>
[[gnu::noinline]]
std::uint64_t sum1(std::uint64_t n) {
std::uint64_t sum = 0;
for (std::uint64_t j = 0; j <= n; ++j) {
sum += 1;
}
return sum;
}
#include <iostream>
int main() {
std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;
return 0;
}
This is because when the loop counter j
finally reaches 0xffffffff'ffffffff
, incrementing it wraps around to 0, which means the loop invariant j <= n
is always fulfilled and the loop never exits.
Strictly speaking, invoking sum1
with 0xffffffff'ffffffff
infamously triggers undefined behaviour, though not because of overflow itself, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why I applied the [[gnu::noinline]]
attribute to the function to prevent the compiler from taking ‘advantage’ of that in optimisation passes.
The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:
use std::num::Wrapping;
fn sum1(num: u64) -> u64 {
let mut sum = Wrapping(0);
for _ in 0..=num {
sum += Wrapping(1);
}
return sum.0;
}
fn main() {
println!("{}", sum1(0xffffffff_ffffffff));
}
The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18446744073709551616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.
@user3840170 correctly identified the difference: overflow check in Rust, and not in C++.
Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive
design is not well adapted to LLVM1.
The optimal code generation for short loops, is to split the loop, transforming:
for j in 0u64..=num {
sum += 1;
}
Into:
for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
sum += 1;
}
if 0 <= num {
sum += 1;
}
This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula2.
The Loop Splitting optimization applies to RangeInclusive
and most other Chain
iterators, as indeed a RangeInclusive
can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.
Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc
backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.
1See this comment I left over performance issues with RangeInclusive
. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive
(and all ranges) were NOT Iterator
; it's a costly mistake.
2Chain iterators, in general, perform much better using .for_each(...)
, specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1)
being reduced to num + 1
.
These two loops are equivalent in C++ and Rust
Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j)
doesn't handle n == uint64_t::MAX
(make it infinite looping) while for j in 0u64..=num
do (will never go into an infinite loop).
A rust equivalent code could be:
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
let mut j = 0;
while j <= num {
sum = sum.wrapping_add(1);
j = j.wrapping_add(1);
}
sum
}
currently produce the following asm godbolt:
example::sum1:
xor eax, eax
.LBB0_1:
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret