It seems to be common knowledge that hash tables can achieve O(1), but that has never made sense to me. Can someone please explain it? Here are two situations that come to mind:

A. The value is an int smaller than the size of the hash table. Therefore, the value is its own hash, so there is no hash table. But if there was, it would be O(1) and still be inefficient.

B. You have to calculate a hash of the value. In this situation, the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.

And unless you have a perfect hash or a large hash table, there are probably several items per bucket. So, it devolves into a small linear search at some point anyway.

I think hash tables are awesome, but I do not get the O(1) designation unless it is just supposed to be theoretical.

Wikipedia's article for hash tables consistently references constant lookup time and totally ignores the cost of the hash function. Is that really a fair measure?


Edit: To summarize what I learned:

  • It is technically true because the hash function is not required to use all the information in the key and so could be constant time, and because a large enough table can bring collisions down to near constant time.

  • It is true in practice because over time it just works out as long as the hash function and table size are chosen to minimize collisions, even though that often means not using a constant time hash function.


Solution 1:

You have two variables here, m and n, where m is the length of the input and n is the number of items in the hash.

The O(1) lookup performance claim makes at least two assumptions:

  • Your objects can be equality compared in O(1) time.
  • There will be few hash collisions.

If your objects are variable size and an equality check requires looking at all bits then performance will become O(m). The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.

For sufficiently many items the number of items will become greater than the number of possible hashes and then you will get collisions causing the performance rise above O(1), for example O(n) for a simple linked list traversal (or O(n*m) if both assumptions are false).

In practice though the O(1) claim while technically false, is approximately true for many real world situations, and in particular those situations where the above assumptions hold.

Solution 2:

You have to calculate the hash, so the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.

What? To hash a single element takes constant time. Why would it be anything else? If you're inserting n elements, then yes, you have to compute n hashes, and that takes linear time... to look an element up, you compute a single hash of what you're looking for, then find the appropriate bucket with that. You don't re-compute the hashes of everything that's already in the hash table.

And unless you have a perfect hash or a large hash table there are probably several items per bucket so it devolves into a small linear search at some point anyway.

Not necessarily. The buckets don't necessarily have to be lists or arrays, they can be any container type, such as a balanced BST. That means O(log n) worst case. But this is why it's important to choose a good hashing function to avoid putting too many elements into one bucket. As KennyTM pointed out, on average, you will still get O(1) time, even if occasionally you have to dig through a bucket.

The trade off of hash tables is of course the space complexity. You're trading space for time, which seems to be the usual case in computing science.


You mention using strings as keys in one of your other comments. You're concerned about the amount of time it takes to compute the hash of a string, because it consists of several chars? As someone else pointed out again, you don't necessarily need to look at all the chars to compute the hash, although it might produce a better hash if you did. In that case, if there are on average m chars in your key, and you used all of them to compute your hash, then I suppose you're right, that lookups would take O(m). If m >> n then you might have a problem. You'd probably be better off with a BST in that case. Or choose a cheaper hashing function.

Solution 3:

The hash is fixed size - looking up the appropriate hash bucket is a fixed cost operation. This means that it is O(1).

Calculating the hash does not have to be a particularly expensive operation - we're not talking cryptographic hash functions here. But that's by the by. The hash function calculation itself does not depend on the number n of elements; while it might depend on the size of the data in an element, this is not what n refers to. So the calculation of the hash does not depend on n and is also O(1).

Solution 4:

Hashing is O(1) only if there are only constant number of keys in the table and some other assumptions are made. But in such cases it has advantage.

If your key has an n-bit representation, your hash function can use 1, 2, ... n of these bits. Thinking about a hash function that uses 1 bit. Evaluation is O(1) for sure. But you are only partitioning the key space into 2. So you are mapping as many as 2^(n-1) keys into the same bin. using BST search this takes up to n-1 steps to locate a particular key if nearly full.

You can extend this to see that if your hash function uses K bits your bin size is 2^(n-k).

so K-bit hash function ==> no more than 2^K effective bins ==> up to 2^(n-K) n-bit keys per bin ==> (n-K) steps (BST) to resolve collisions. Actually most hash functions are much less "effective" and need/use more than K bits to produce 2^k bins. So even this is optimistic.

You can view it this way -- you will need ~n steps to be able to uniquely distinguish a pair of keys of n bits in the worst case. There is really no way to get around this information theory limit, hash table or not.

However, this is NOT how/when you use hash table!

The complexity analysis assumes that for n-bit keys, you could have O(2^n) keys in the table (e.g. 1/4 of all possible keys). But most if not all of the time we use hash table, we only have a constant number of the n-bit keys in the table. If you only want a constant number of keys in the table, say C is your maximum number, then you could form a hash table of O(C) bins, that guarantees expected constant collision (with a good hash function); and a hash function using ~logC of the n bits in the key. Then every query is O(logC) = O(1). This is how people claim "hash table access is O(1)"/

There are a couple of catches here -- first, saying you don't need all the bits may only be a billing trick. First you cannot really pass the key value to the hash function, because that would be moving n bits in the memory which is O(n). So you need to do e.g. a reference passing. But you still need to store it somewhere already which was an O(n) operation; you just don't bill it to the hashing; you overall computation task cannot avoid this. Second, you do the hashing, find the bin, and found more than 1 keys; your cost depends on your resolution method -- if you do comparison based (BST or List), you will have O(n) operation (recall key is n-bit); if you do 2nd hash, well, you have the same issue if 2nd hash has collision. So O(1) is not 100% guaranteed unless you have no collision (you can improve the chance by having a table with more bins than keys, but still).

Consider the alternative, e.g. BST, in this case. there are C keys, so a balanced BST will be O(logC) in depth, so a search takes O(logC) steps. However the comparison in this case would be an O(n) operation ... so it appears hashing is a better choice in this case.