Why is copying a shuffled list much slower?

Copying a shuffled range(10**6) list ten times takes me about 0.18 seconds: (these are five runs)

0.175597017661
0.173731403198
0.178601711594
0.180330912952
0.180811964451

Copying the unshuffled list ten times takes me about 0.05 seconds:

0.058402235973
0.0505464636856
0.0509734306934
0.0526022752744
0.0513324916184

Here's my testing code:

from timeit import timeit
import random

a = range(10**6)
random.shuffle(a)    # Remove this for the second test.
a = list(a)          # Just an attempt to "normalize" the list.
for _ in range(5):
    print timeit(lambda: list(a), number=10)

I also tried copying with a[:], the results were similar (i.e., big speed difference)

Why the big speed difference? I know and understand the speed difference in the famous Why is it faster to process a sorted array than an unsorted array? example, but here my processing has no decisions. It's just blindly copying the references inside the list, no?

I'm using Python 2.7.12 on Windows 10.

Edit: Tried Python 3.5.2 as well now, the results were almost the same (shuffled consistently around 0.17 seconds, unshuffled consistently around 0.05 seconds). Here's the code for that:

a = list(range(10**6))
random.shuffle(a)
a = list(a)
for _ in range(5):
    print(timeit(lambda: list(a), number=10))

The interesting bit is that it depends on the order in which the integers are first created. For example instead of shuffle create a random sequence with random.randint:

from timeit import timeit
import random

a = [random.randint(0, 10**6) for _ in range(10**6)]
for _ in range(5):
    print(timeit(lambda: list(a), number=10))

This is as fast as copying your list(range(10**6)) (first and fast example).

However when you shuffle - then your integers aren't in the order they were first created anymore, that's what makes it slow.

A quick intermezzo:

  • All Python objects are on the heap, so every object is a pointer.
  • Copying a list is a shallow operation.
  • However Python uses reference counting so when an object is put in a new container it's reference count must be incremented (Py_INCREF in list_slice), so Python really needs to go to where the object is. It can't just copy the reference.

So when you copy your list you get each item of that list and put it "as is" in the new list. When your next item was created shortly after the current one there is a good chance (no guarantee!) that it's saved next to it on the heap.

Let's assume that whenever your computer loads an item in the cache it also loads the x next-in-memory items (cache locality). Then your computer can perform the reference count increment for x+1 items on the same cache!

With the shuffled sequence it still loads the next-in-memory items but these aren't the ones next-in-list. So it can't perform the reference-count increment without "really" looking for the next item.

TL;DR: The actual speed depends on what happened before the copy: in what order were these items created and in what order are these in the list.


You can verify this by looking at the id:

CPython implementation detail: This is the address of the object in memory.

a = list(range(10**6, 10**6+100))
for item in a:
    print(id(item))

Just to show a short excerpt:

1496489995888
1496489995920  # +32
1496489995952  # +32
1496489995984  # +32
1496489996016  # +32
1496489996048  # +32
1496489996080  # +32
1496489996112
1496489996144
1496489996176
1496489996208
1496489996240
1496507297840
1496507297872
1496507297904
1496507297936
1496507297968
1496507298000
1496507298032
1496507298064
1496507298096
1496507298128
1496507298160
1496507298192

So these objects are really "next to each other on the heap". With shuffle they aren't:

import random
a = list(range(10**6, 100+10**6))
random.shuffle(a)
last = None
for item in a:
    if last is not None:
        print('diff', id(item) - id(last))
    last = item

Which shows these are not really next to each other in memory:

diff 736
diff -64
diff -17291008
diff -128
diff 288
diff -224
diff 17292032
diff -1312
diff 1088
diff -17292384
diff 17291072
diff 608
diff -17290848
diff 17289856
diff 928
diff -672
diff 864
diff -17290816
diff -128
diff -96
diff 17291552
diff -192
diff 96
diff -17291904
diff 17291680
diff -1152
diff 896
diff -17290528
diff 17290816
diff -992
diff 448

Important note:

I haven't thought this up myself. Most of the informations can be found in the blogpost of Ricky Stewart.

This answer is based on the "official" CPython implementation of Python. The details in other implementations (Jython, PyPy, IronPython, ...) may be different. Thanks @JörgWMittag for pointing this out.


When you shuffle the list items, they have worse locality of reference, leading to worse cache performance.

You might think that copying the list just copies the references, not the objects, so their locations on the heap shouldn't matter. However, copying still involves accessing each object in order to modify the refcount.


As explained by others, it's not just copying the references but also increases the reference counts inside the objects and thus the objects are accessed and the cache plays a role.

Here I just want to add more experiments. Not so much about shuffled vs unshuffled (where accessing one element might miss the cache but get the following elements into the cache so they get hit). But about repeating elements, where later accesses of the same element might hit the cache because the element is still in the cache.

Testing a normal range:

>>> from timeit import timeit
>>> a = range(10**7)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[5.1915339142808925, 5.1436351868889645, 5.18055115701749]

A list of the same size but with just one element repeated over and over again is faster because it hits the cache all the time:

>>> a = [0] * 10**7
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[4.125743135926939, 4.128927210087596, 4.0941229388550795]

And it doesn't seem to matter what number it is:

>>> a = [1234567] * 10**7
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[4.124106479141709, 4.156590225249886, 4.219242600790949]

Interestingly, it gets even faster when I instead repeat the same two or four elements:

>>> a = [0, 1] * (10**7 / 2)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[3.130586101607932, 3.1001001764957294, 3.1318465707127814]

>>> a = [0, 1, 2, 3] * (10**7 / 4)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[3.096105435911994, 3.127148431279352, 3.132872673690855]

I guess something doesn't like the same single counter increased all the time. Maybe some pipeline stall because each increase has to wait for the result of the previous increase, but this is a wild guess.

Anyway, trying this for even larger numbers of repeated elements:

from timeit import timeit
for e in range(26):
    n = 2**e
    a = range(n) * (2**25 / n)
    times = [timeit(lambda: list(a), number=20) for _ in range(3)]
    print '%8d ' % n, '  '.join('%.3f' % t for t in times), ' => ', sum(times) / 3

The output (first column is the number of different elements, for each I test three times and then take the average):

       1  2.871  2.828  2.835  =>  2.84446732686
       2  2.144  2.097  2.157  =>  2.13275338734
       4  2.129  2.297  2.247  =>  2.22436720645
       8  2.151  2.174  2.170  =>  2.16477771575
      16  2.164  2.159  2.167  =>  2.16328197911
      32  2.102  2.117  2.154  =>  2.12437970598
      64  2.145  2.133  2.126  =>  2.13462250728
     128  2.135  2.122  2.137  =>  2.13145065221
     256  2.136  2.124  2.140  =>  2.13336283943
     512  2.140  2.188  2.179  =>  2.1688431668
    1024  2.162  2.158  2.167  =>  2.16208440826
    2048  2.207  2.176  2.213  =>  2.19829998424
    4096  2.180  2.196  2.202  =>  2.19291917834
    8192  2.173  2.215  2.188  =>  2.19207065277
   16384  2.258  2.232  2.249  =>  2.24609975704
   32768  2.262  2.251  2.274  =>  2.26239771771
   65536  2.298  2.264  2.246  =>  2.26917420394
  131072  2.285  2.266  2.313  =>  2.28767871168
  262144  2.351  2.333  2.366  =>  2.35030805124
  524288  2.932  2.816  2.834  =>  2.86047313113
 1048576  3.312  3.343  3.326  =>  3.32721167007
 2097152  3.461  3.451  3.547  =>  3.48622758473
 4194304  3.479  3.503  3.547  =>  3.50964316455
 8388608  3.733  3.496  3.532  =>  3.58716466865
16777216  3.583  3.522  3.569  =>  3.55790996695
33554432  3.550  3.556  3.512  =>  3.53952594744

So from about 2.8 seconds for a single (repeated) element it drops to about 2.2 seconds for 2, 4, 8, 16, ... different elements and stays at about 2.2 seconds until the hundred thousands. I think this uses my L2 cache (4 × 256 KB, I have an i7-6700).

Then over a few steps, the times go up to 3.5 seconds. I think this uses a mix of my L2 cache and my L3 cache (8 MB) until that's "exhausted" as well.

At the end it stays at around 3.5 seconds, I guess because my caches don't help with the repeated elements anymore.