Set vs. frozenset performance
I was tinkering around with Python's set
and frozenset
collection types.
Initially, I assumed that frozenset
would provide a better lookup performance than set
, as its immutable and thus could exploit the structure of the stored items.
However, this does not seem to be the case, regarding the following experiment:
import random
import time
import sys
def main(n):
numbers = []
for _ in xrange(n):
numbers.append(random.randint(0, sys.maxint))
set_ = set(numbers)
frozenset_ = frozenset(set_)
start = time.time()
for number in numbers:
number in set_
set_duration = time.time() - start
start = time.time()
for number in numbers:
number in frozenset_
frozenset_duration = time.time() - start
print "set : %.3f" % set_duration
print "frozenset: %.3f" % frozenset_duration
if __name__ == "__main__":
n = int(sys.argv[1])
main(n)
I executed this code using both CPython and PyPy, which gave the following results:
> pypy set.py 100000000
set : 6.156
frozenset: 6.166
> python set.py 100000000
set : 16.824
frozenset: 17.248
It seems that frozenset
is actually slower regarding the lookup performance, both in CPython and in PyPy. Does anybody have an idea why this is the case? I did not look into the implementations.
The frozenset
and set
implementations are largely shared; a set
is simply a frozenset
with mutating methods added, with the exact same hashtable implementation. See the Objects/setobject.c
source file; the top-level PyFrozenSet_Type
definition shares functions with the PySet_Type
definition.
There is no optimisation for a frozenset here, as there is no need to calculate the hashes for the items in the frozenset
when you are testing for membership. The item that you use to test against the set still needs to have their hash calculated, in order to find the right slot in the set hashtable so you can do an equality test.
As such, your timing results are probably off due to other processes running on your system; you measured wall-clock time, and did not disable Python garbage collection nor did you repeatedly test the same thing.
Try to run your test using the timeit
module, with one value from numbers
and one not in the set:
import random
import sys
import timeit
numbers = [random.randrange(sys.maxsize) for _ in range(10000)]
set_ = set(numbers)
fset = frozenset(numbers)
present = random.choice(numbers)
notpresent = -1
test = 'present in s; notpresent in s'
settime = timeit.timeit(
test,
'from __main__ import set_ as s, present, notpresent')
fsettime = timeit.timeit(
test,
'from __main__ import fset as s, present, notpresent')
print('set : {:.3f} seconds'.format(settime))
print('frozenset: {:.3f} seconds'.format(fsettime))
This repeats each test 1 million times and produces:
set : 0.050 seconds
frozenset: 0.050 seconds
The reason for the two different datatypes is not for performance, it is functional. Because frozensets are immutable they can be used as a key in dictionaries. Sets cannot be used for this purpose.