Performant Haskell hashed structure.
I am writing program that does alot of table lookups. As such, I was perusing the Haskell documentation when I stumbled upon Data.Map
(of course), but also Data.HashMap
and Data.Hashtable
. I am no expert on hashing algorithms and after inspecting the packages they all seem really similar. As such I was wondering:
1: what are the major differences, if any?
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
Solution 1:
1: What are the major differences, if any?
-
Data.Map.Map
is a balanced binary tree internally, so its time complexity for lookups is O(log n). I believe it's a "persistent" data structure, meaning it's implemented such that mutative operations yield a new copy with only the relevant parts of the structure updated. -
Data.HashMap.Map
is aData.IntMap.IntMap
internally, which in turn is implemented as Patricia tree; its time complexity for lookups is O(min(n, W)) where W is the number of bits in an integer. It is also "persistent.". New versions (>= 0.2) use hash array mapped tries. According to the documentation: "Many operations have a average-case complexity of O(log n). The implementation uses a large base (i.e. 16) so in practice these operations are constant time." -
Data.HashTable.HashTable
is an actual hash table, with time complexity O(1) for lookups. However, it is a mutable data structure -- operations are done in-place -- so you're stuck in theIO
monad if you want to use it.
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
The best answer I can give you, unfortunately, is "it depends." If you take the asymptotic complexities literally, you get O(log 4000) = about 12 for Data.Map
, O(min(4000, 64)) = 64 for Data.HashMap
and O(1) = 1 for Data.HashTable
. But it doesn't really work that way... You have to try them in the context of your code.
Solution 2:
The obvious difference between Data.Map
and Data.HashMap
is that the former needs keys in Ord
, the latter Hashable
keys. Most of the common keys are both, so that's not a deciding criterion. I have no experience whatsoever with Data.HashTable
, so I can't comment on that.
The APIs of Data.HashMap
and Data.Map
are very similar, but Data.Map
exports more functions, some, like alter
are absent in Data.HashMap
, others are provided in strict and non-strict variants, while Data.HashMap
(I assume you meant the hashmap from unordered-containers) provides lazy and strict APIs in separate modules. If you are using only the common part of the API, switching is really painless.
Concerning performance, Data.HashMap
of unordered-containers has pretty fast lookup, last I measured, it was clearly faster than Data.IntMap
or Data.Map
, that holds in particular for the (not yet released) HAMT branch of unordered-containers. I think for inserts, it was more or less on par with Data.IntMap
and somewhat faster than Data.Map
, but I'm a bit fuzzy on that.
Both are sufficiently performant for most tasks, for those tasks where they aren't, you'll probably need a tailor-made solution anyway. Considering that you ask specifically about lookups, I would give Data.HashMap
the edge.
Solution 3:
Data.HashTable
's documentation now says "use the hashtables package". There's a nice blog post explaining why hashtables is a good package here. It uses the ST monad.