OrderedDict vs defaultdict vs dict [closed]
In python's library, we now have two Python Implementation of dictionaries which subclasses dict
over and above the native dict
type.
Python's advocates have always preferred to defaultdict
over using dict.setdefault
where possible. Even the doc quotes that This technique is simpler and faster than an equivalent technique using dict.setdefault():
In similar ways, as dictionaries does not maintain order, using OrderedDict
over using dict
followed by sorting the items is preferred when ever possible for the alternative usage.
In both the above case, the code is definitely cleaner but at the cost of performance penalty.
While answering and commenting on one of the question python unique list based on item, I stumbled upon the performance penalty over the native dict
when using defaultdict
and OrderedDict
. It also seems the size of the data is also not immaterial to the performance advantage dict
solution has over others.
I believe There should be one-- and preferably only one --obvious way to do it.
, so what is the preferred way?
There is not one single answer and not one true and only dict. Among many variables, it depends on:
- Size of the data set;
- Number of unique keys vs the number of duplicate keys in the set of data mappings;
- Speed of the underlying factory for defaultdict;
- Speed of OrderDict vs some later ordering step;
- Version of Python.
I am loathe to generalize, but here are some generalities:
- The statement
This technique is simpler and faster than an equivalent technique using dict.setdefault()
is just flat wrong. It depends on the data; -
setdefault
is faster and simpler with small data sets; -
defaultdict
is faster for larger data sets with more homogenous key sets (ie, how short the dict is after adding elements); -
setdefault
has an advantage with more heterogeneous key sets; - these results are different for Python 3 vs Python 2;
-
OrderedDict
is slower in all cases other than an algorithm that depends on order and order is not easy to reconstruct or sort; - Python 3 is generally faster for most
dict
operations; - Python 3.6's dict is now ordered by insertion order (reducing the usefulness of
OrderedDict
).
The only truth: It Depends! All three technique are useful.
Here is some timing code to show:
from __future__ import print_function
from collections import defaultdict
from collections import OrderedDict
try:
t=unichr(100)
except NameError:
unichr=chr
def f1(li):
'''defaultdict'''
d = defaultdict(list)
for k, v in li:
d[k].append(v)
return d.items()
def f2(li):
'''setdefault'''
d={}
for k, v in li:
d.setdefault(k, []).append(v)
return d.items()
def f3(li):
'''OrderedDict'''
d=OrderedDict()
for k, v in li:
d.setdefault(k, []).append(v)
return d.items()
if __name__ == '__main__':
import timeit
import sys
print(sys.version)
few=[('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
fmt='{:>12}: {:10.2f} micro sec/call ({:,} elements, {:,} keys)'
for tag, m, n in [('small',5,10000), ('medium',20,1000), ('bigger',1000,100), ('large',5000,10)]:
for f in [f1,f2,f3]:
s = few*m
res=timeit.timeit("{}(s)".format(f.__name__), setup="from __main__ import {}, s".format(f.__name__), number=n)
st=fmt.format(f.__doc__, res/n*1000000, len(s), len(f(s)))
print(st)
s = [(unichr(i%0x10000),i) for i in range(1,len(s)+1)]
res=timeit.timeit("{}(s)".format(f.__name__), setup="from __main__ import {}, s".format(f.__name__), number=n)
st=fmt.format(f.__doc__, res/n*1000000, len(s), len(f(s)))
print(st)
print()
Python 2.7 result:
2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]
defaultdict: 10.20 micro sec/call (25 elements, 3 keys)
defaultdict: 21.08 micro sec/call (25 elements, 25 keys)
setdefault: 13.41 micro sec/call (25 elements, 3 keys)
setdefault: 18.24 micro sec/call (25 elements, 25 keys)
OrderedDict: 49.47 micro sec/call (25 elements, 3 keys)
OrderedDict: 102.16 micro sec/call (25 elements, 25 keys)
defaultdict: 28.28 micro sec/call (100 elements, 3 keys)
defaultdict: 79.78 micro sec/call (100 elements, 100 keys)
setdefault: 45.68 micro sec/call (100 elements, 3 keys)
setdefault: 68.66 micro sec/call (100 elements, 100 keys)
OrderedDict: 117.78 micro sec/call (100 elements, 3 keys)
OrderedDict: 343.17 micro sec/call (100 elements, 100 keys)
defaultdict: 1123.60 micro sec/call (5,000 elements, 3 keys)
defaultdict: 4250.44 micro sec/call (5,000 elements, 5,000 keys)
setdefault: 2089.86 micro sec/call (5,000 elements, 3 keys)
setdefault: 3803.03 micro sec/call (5,000 elements, 5,000 keys)
OrderedDict: 4399.16 micro sec/call (5,000 elements, 3 keys)
OrderedDict: 16279.14 micro sec/call (5,000 elements, 5,000 keys)
defaultdict: 5609.39 micro sec/call (25,000 elements, 3 keys)
defaultdict: 25351.60 micro sec/call (25,000 elements, 25,000 keys)
setdefault: 10267.00 micro sec/call (25,000 elements, 3 keys)
setdefault: 24091.51 micro sec/call (25,000 elements, 25,000 keys)
OrderedDict: 22091.98 micro sec/call (25,000 elements, 3 keys)
OrderedDict: 94028.00 micro sec/call (25,000 elements, 25,000 keys)
Python 3.3 result:
3.3.2 (default, May 21 2013, 11:50:47)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
defaultdict: 8.58 micro sec/call (25 elements, 3 keys)
defaultdict: 21.18 micro sec/call (25 elements, 25 keys)
setdefault: 10.42 micro sec/call (25 elements, 3 keys)
setdefault: 14.58 micro sec/call (25 elements, 25 keys)
OrderedDict: 45.43 micro sec/call (25 elements, 3 keys)
OrderedDict: 92.69 micro sec/call (25 elements, 25 keys)
defaultdict: 20.47 micro sec/call (100 elements, 3 keys)
defaultdict: 77.48 micro sec/call (100 elements, 100 keys)
setdefault: 34.22 micro sec/call (100 elements, 3 keys)
setdefault: 54.86 micro sec/call (100 elements, 100 keys)
OrderedDict: 107.37 micro sec/call (100 elements, 3 keys)
OrderedDict: 318.98 micro sec/call (100 elements, 100 keys)
defaultdict: 714.70 micro sec/call (5,000 elements, 3 keys)
defaultdict: 3892.92 micro sec/call (5,000 elements, 5,000 keys)
setdefault: 1502.91 micro sec/call (5,000 elements, 3 keys)
setdefault: 2888.08 micro sec/call (5,000 elements, 5,000 keys)
OrderedDict: 3912.95 micro sec/call (5,000 elements, 3 keys)
OrderedDict: 14863.02 micro sec/call (5,000 elements, 5,000 keys)
defaultdict: 3649.02 micro sec/call (25,000 elements, 3 keys)
defaultdict: 22313.17 micro sec/call (25,000 elements, 25,000 keys)
setdefault: 7447.28 micro sec/call (25,000 elements, 3 keys)
setdefault: 18426.88 micro sec/call (25,000 elements, 25,000 keys)
OrderedDict: 19202.17 micro sec/call (25,000 elements, 3 keys)
OrderedDict: 85946.45 micro sec/call (25,000 elements, 25,000 keys)
I feel that your assumption - only one preferable way - does not hold. I see at least two cases with different requirements:
In maintenance-intensive code (e.g. an option parser of an evolving utility class) I would always go for cleaner code, so that others and I can implement new features more easily. The performance is not critical, as only small quantities (e.g. a dict of user settings) are processed.
while in
the implementation of a performance-critical algorithm in a data processing task, I would not mind writing a bit more verbose code for much faster execution. If the alorithm is unlikely to change, the little less readable code won't become an issue.