Most efficient property to hash for numpy array
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a))
is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
You can try xxhash
via its Python binding. For large arrays this is much faster than hash(x.tostring())
.
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1
or md5
as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__
for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__
itself and let Python handle the hash collision[1].
[1] You may need to override __eq__
too, to help Python manage hash collision. You would want __eq__
to return a boolean, rather than an array of booleans as is done by numpy
.