what is the difference between set and unordered_set in C++?
Solution 1:
I think you've generally answered your own question, however, this:
Not as compact as tree. (for practical purposes load factors is never 1)
is not necessarily true. Each node of a tree (we'll assume it's a red-black tree) for a type T
utilizes space that is equal to at least 2 * pointer_size + sizeof(T) + sizeof(bool)
. This may be 3 * pointer size
depending on whether the tree contains a parent
pointer for each tree node.
Compare this to a hash-map: there will be wasted array space for each hash map due to the fact that load factor < 1
as you've said. However, assuming the hash map uses singly linked lists for chaining (and really, there's no real reason not to), each element inserted take only sizeof(T) + pointer size
.
Note that this analysis ignores any overhead which may come from extra space used by alignment.
For any element T
which has a small size (so, any basic type), the size of the pointers and other overhead dominates. At a load factor of > 0.5
(for example) the std::unordered_set
may indeed use up less memory than the equivalent std::set
.
The other big missing point is the fact that iterating through a std::set
is guaranteed to produce an ordering from smallest to largest, based on the given comparison function, while iterating through an std::unordered_set
will return the values in a "random" order.
Solution 2:
Another difference (though not performance-related) is that set
insertion doesn't invalidate iterators, while unordered_set
insertion can if it triggers a rehash. In practice it's a pretty minor concern, since references to the actual elements remain valid.
Solution 3:
Yuushi addresses spatial efficiency and other points well already; just a few other parts of the question I'll comment on...
The O(1), for hashtable comes from the assumption that there are no collision.
That's not true. What O(1) means is not that the first lookup attempt will always succeed, it's that there is - on average - a constant number of attempts needed, rather than something that grows as the number of values grows. For example, with an unordered_set
or ..._map
, the max_load_factor
defaults to 1.0 on construction, and if load factor approaches that with a good hash function, the average number of elements that hash to any one bucket will be around 2 regardless of how many values are in the table.
Even with load-factor of .5, every second variable insertion is leading to collision.
True, but it doesn't get as dire as you might intuitively expect: that average chain length of 2 at 1.0 load factor's not bad.
It could be observed that the load-factor of hash-table is inversely proportional to the number of operations required for accessing a element in it. More we reduce #operations, sparser hash-table.
There's definitely a correlation (it's not inverse).
Solution 4:
In some case set
is more convenient.
For example using vector
as key:
set<vector<int>> s;
s.insert({1, 2});
s.insert({1, 3});
s.insert({1, 2});
for(const auto& vec:s)
cout<<vec<<endl; // I have override << for vector
// 1 2
// 1 3
The reason why vector<int>
can be in set
because vector
override operator<
.
But if you use unordered_set<vector<int>>
you have to create a hash function for vector<int>
, because vector does't have a hash function, so you have to define one like:
struct VectorHash {
size_t operator()(const std::vector<int>& v) const {
std::hash<int> hasher;
size_t seed = 0;
for (int i : v) {
seed ^= hasher(i) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
return seed;
}
};
vector<vector<int>> two(){
//unordered_set<vector<int>> s; // error vector<int> doesn't have hash function
unordered_set<vector<int>, VectorHash> s;
s.insert({1, 2});
s.insert({1, 3});
s.insert({1, 2});
for(const auto& vec:s)
cout<<vec<<endl;
// 1 2
// 1 3
}
you can see that in some case unordered_set
is more complicated.
Mainly cited from: https://stackoverflow.com/a/29855973/6329006
More difference between unordered_set
and set
see this: https://stackoverflow.com/a/52203931/6329006