Hashing a dictionary?
For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items())))
(sha1()
is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Solution 1:
Using sorted(d.items())
isn't enough to get us a stable repr. Some of the values in d
could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators
and ensure_ascii
arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
Solution 2:
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash()
:
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
Solution 3:
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash()
function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__
attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__
property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash()
will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code