Why is HashSet<Point> so much slower than HashSet<string>?

I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet<Point> or similar classes. However this seems to be very slow compared to something like HashSet<string>.

For example, this code:

HashSet<Point> points = new HashSet<Point>();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(new Point(x, y));
        }
    }
}

takes about 22.5 seconds.

While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:

HashSet<string> points = new HashSet<string>();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(x + "," + y);
        }
    }
}

So, my questions are:

Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
Is there a better way to store points without duplicates?

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".

At issue is that HashSet<T> needs an IEqualityComparer<T> to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.

The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.

You can solve both problems by providing the HashSet with a good comparer. Like this one:

class PointComparer : IEqualityComparer<Point> {
    public bool Equals(Point x, Point y) {
        return x.X == y.X && x.Y == y.Y;
    }

    public int GetHashCode(Point obj) {
        // Perfect hash for practical bitmaps, their width/height is never >= 65536
        return (obj.Y << 16) ^ obj.X;
    }
}

And use it:

HashSet<Point> list = new HashSet<Point>(new PointComparer());

And it is now about 150 times faster, easily beating the string test.

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).

Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.

Also note that the hash code of Point is computed by x ^ y. This produces very little dispersion in your data range, and therefore the buckets of the HashSet are overpopulated — something that doesn't happen with string, where the dispersion of the hashes is much larger.

You can solve that problem by implementing your own Point struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:

(x << 16) ^ y

For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

Why is HashSet<Point> so much slower than HashSet<string>?

Related

Recent Posts