Why is code using intermediate variables faster than code without?

Solution 1:

My results were similar to yours: the code that uses intermediate variables was consistently at least 10-20 % faster in Python 3.4. However when I used IPython on the very same Python 3.4 interpreter, I got these results:

In [1]: %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000))
10000 loops, best of 20: 74.2 µs per loop

In [2]: %timeit -n10000 -r20 a = tuple(range(2000));  b = tuple(range(2000)); a==b
10000 loops, best of 20: 75.7 µs per loop

Notably, I never managed to get even close to the 74.2 µs for the former when I used -mtimeit from the command line.

So this Heisenbug turned out to be something quite interesting. I decided to run the command with strace and indeed there is something fishy going on:

% strace -o withoutvars python3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 134 usec per loop
% strace -o withvars python3 -mtimeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 75.8 usec per loop
% grep mmap withvars|wc -l
46
% grep mmap withoutvars|wc -l
41149

Now that is a good reason for the difference. The code that does not use variables causes the mmap system call be called almost 1000x more than the one that uses intermediate variables.

The withoutvars is full of mmap/munmap for a 256k region; these same lines are repeated over and over again:

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0

The mmap call seems to be coming from the function _PyObject_ArenaMmap from Objects/obmalloc.c; the obmalloc.c also contains the macro ARENA_SIZE, which is #defined to be (256 << 10) (that is 262144); similarly the munmap matches the _PyObject_ArenaMunmap from obmalloc.c.

obmalloc.c says that

Prior to Python 2.5, arenas were never free()'ed. Starting with Python 2.5, we do try to free() arenas, and use some mild heuristic strategies to increase the likelihood that arenas eventually can be freed.

Thus these heuristics and the fact that Python object allocator releases these free arenas as soon as they're emptied lead to python3 -mtimeit 'tuple(range(2000)) == tuple(range(2000))' triggering pathological behaviour where one 256 kiB memory area is re-allocated and released repeatedly; and this allocation happens with mmap/munmap, which is comparatively costly as they're system calls - furthermore, mmap with MAP_ANONYMOUS requires that the newly mapped pages must be zeroed - even though Python wouldn't care.

The behaviour is not present in the code that uses intermediate variables, because it is using slightly more memory and no memory arena can be freed as some objects are still allocated in it. That is because timeit will make it into a loop not unlike

for n in range(10000)
    a = tuple(range(2000))
    b = tuple(range(2000))
    a == b

Now the behaviour is that both a and b will stay bound until they're *reassigned, so in the second iteration, tuple(range(2000)) will allocate a 3rd tuple, and the assignment a = tuple(...) will decrease the reference count of the old tuple, causing it to be released, and increase the reference count of the new tuple; then the same happens to b. Therefore after the first iteration there are always at least 2 of these tuples, if not 3, so the thrashing doesn't occur.

Most notably it cannot be guaranteed that the code using intermediate variables is always faster - indeed in some setups it might be that using intermediate variables will result in extra mmap calls, whereas the code that compares return values directly might be fine.


Someone asked that why this happens, when timeit disables garbage collection. It is indeed true that timeit does it:

Note

By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:

However, the garbage collector of Python is only there to reclaim cyclic garbage, i.e. collections of objects whose references form cycles. It is not the case here; instead these objects are freed immediately when the reference count drops to zero.

Solution 2:

The first question here has to be, is it reproducable? For some of us at least it definitely is though other people say they aren't seeing the effect. This on Fedora, with the equality test changed to is as actually doing a comparison seems irrelevant to the result, and the range pushed up to 200,000 as that seems to maximise the effect:

$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7.03 msec per loop
$ python3 -m timeit "a = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "a = b = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 9.99 msec per loop
$ python3 -m timeit "a = b = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.1 msec per loop
$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7 msec per loop
$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7.02 msec per loop

I note that variations between the runs, and the order in which the expressions are run make very little difference to the result.

Adding assignments to a and b into the slow version doesn't speed it up. In fact as we might expect assigning to local variables has negligible effect. The only thing that does speed it up is splitting the expression entirely in two. The only difference this should be making is that it reduces the maximum stack depth used by Python while evaluating the expression (from 4 to 3).

That gives us the clue that the effect is related to stack depth, perhaps the extra level pushes the stack across into another memory page. If so we should see that making other changes that affect the stack will change (most likely kill the effect), and in fact that is what we see:

$ python3 -m timeit -s "def foo():
   tuple(range(200000)) is tuple(range(200000))" "foo()"
100 loops, best of 3: 10 msec per loop
$ python3 -m timeit -s "def foo():
   tuple(range(200000)) is tuple(range(200000))" "foo()"
100 loops, best of 3: 10 msec per loop
$ python3 -m timeit -s "def foo():
   a = tuple(range(200000));  b = tuple(range(200000)); a is b" "foo()"
100 loops, best of 3: 9.97 msec per loop
$ python3 -m timeit -s "def foo():
   a = tuple(range(200000));  b = tuple(range(200000)); a is b" "foo()"
100 loops, best of 3: 10 msec per loop

So, I think the effect is entirely due to how much Python stack is consumed during the timing process. It is still weird though.