Surprising results with Python timeit: Counter() vs defaultdict() vs dict()
Yes, this is expected; the Counter()
constructor uses Counter.update()
which uses self.get()
to load initial values rather than rely on __missing__
.
Moreover, the defaultdict
__missing__
factory is handled entirely in C code, especially when using another type like int()
that is itself implemented in C. The Counter
source is pure Python and as such the Counter.__missing__
method requires a Python frame to execute.
Because dict.get()
is still handled in C, the constructor approach is the faster approach for a Counter()
, provided you use the same trick Counter.update()
uses and alias the self.get
lookup as a local first:
>>> import timeit
>>> import random
>>> import sys
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
>>> to_count = [random.randint(0, 100) for r in range(60)]
>>> timeit.timeit('for i in to_count: c[i] += 1',
... 'from collections import Counter; from __main__ import to_count; c = Counter()',
... number=10000)
0.2510359287261963
>>> timeit.timeit('for i in to_count: c[i] = c_get(i, 0) + 1',
... 'from collections import Counter; from __main__ import to_count; c = Counter(); c_get = c.get',
... number=10000)
0.20978617668151855
Both defaultdict
and Counter
are helpful classes built for their functionality, not their performance; not relying on the __missing__
hook can be faster still:
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
... 'from __main__ import to_count; d = {}; d_get = d.get',
... number=10000)
0.11437392234802246
That's a regular dictionary using an aliased dict.get()
method for maximum speed. But then you'll also have to re-implement the bag behaviour of Counter
, or the Counter.most_common()
method. The defaultdict
use cases go way beyond counting.
In Python 3.2, updating a Counter()
got a speed boost by adding a C library that handles this case; see issue 10667. Testing on Python 3.4, the Counter()
constructor now beats the aliased dict.get
case:
>>> timeit.timeit('Counter(to_count)',
... 'from collections import Counter; from __main__ import to_count',
... number=100000)
0.8332311600097455
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
... 'from __main__ import to_count; d = {}; d_get = d.get',
... number=100000)
0.961191965994658
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)
(Note: to get a meaningful timing result the number of iterations was increased from 10k to 100k; so if you compare these against the dict.get()
case above you need to take the timing there times ten, at 1.144 seconds).