Why can a floating point dictionary key overwrite an integer key with the same value?

First of all: the behaviour is documented explicitly in the docs for the hash function:

hash(object)

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

Secondly, a limitation of hashing is pointed out in the docs for object.__hash__

object.__hash__(self)

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value;

This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().

So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.

In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.

If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.

The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.

Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.

This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.

For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.

Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.

Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.

However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.

If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).

You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.

The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.

So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.

It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.

Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.

In python:

1==1.0
True

This is because of implicit casting

However:

1 is 1.0
False

I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.

It is actually a language design decision and a matter of taste more than different functionalities

Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.

If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.

The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

TL;DR: a dictionary would break if keys that compared equal did not map to the same value.

Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.

Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.

Why can a floating point dictionary key overwrite an integer key with the same value?

Related

Recent Posts