List<T>.Contains() is very slow?

Could anyone explain me why the generics List.Contains() function is so slow?

I have a List<long> with about a million numbers, and the code that is constantly checking if there's a specific number within these numbers.

I tried doing the same thing using Dictionary<long, byte> and the Dictionary.ContainsKey() function, and it was about 10-20 times faster than with the List.

Of course, I don't really want to use Dictionary for that purpose, because it wasn't meant to be used that way.

So, the real question here is, is there any alternative to the List<T>.Contains(), but not as whacky as Dictionary<K,V>.ContainsKey() ?

If you are just checking for existence, HashSet<T> in .NET 3.5 is your best option - dictionary-like performance, but no key/value pair - just the values:

    HashSet<int> data = new HashSet<int>();
    for (int i = 0; i < 1000000; i++)
    {
        data.Add(rand.Next(50000000));
    }
    bool contains = data.Contains(1234567); // etc

List.Contains is a O(n) operation.

Dictionary.ContainsKey is a O(1) operation, since it uses the hashcode of the objects as a key, which gives you a quicker search ability.

I don't think that it 's a good idea to scan through a List which contains a million entries to find a few entries.

Isn't it possible to save those millon entities into a RDBMS for instance, and perform queries on that database ?

If it is not possible, then I would use a Dictionary anyway if you want to do key-lookups.

I think I have the answer! Yes, it's true that Contains() on a list (array) is O(n), but if the array is short and you are using value types, it still should be quite fast. But using the CLR Profiler [free download from Microsoft], I discovered that Contains() is boxing values in order to compare them, which requires heap allocation, which is VERY expensive (slow). [Note: This is .Net 2.0; other .Net versions not tested.]

Here's the full story and solution. We have an enumeration called "VI" and made a class called "ValueIdList" which is an abstract type for a list (array) of VI objects. The original implementation was in the ancient .Net 1.1 days, and it used an encapsulated ArrayList. We discovered recently in http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx that a generic list (List<VI>) performs much better than ArrayList on value types (like our enum VI) because the values don't have to be boxed. It's true and it worked... almost.

The CLR Profiler revealed a surprise. Here's a portion of the Allocation Graph:

ValueIdList::Contains bool(VI) 5.5MB (34.81%)
Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
Values.VI 7.7MB (49.03%)

As you can see, Contains() surprisingly calls Generic.ObjectEqualityComparer.Equals(), which apparently requires the boxing of a VI value, which requires expensive heap allocation. It's odd that Microsoft would eliminate boxing on the list, only to require it again for a simple operation such as this.

Our solution was to re-write the Contains() implementation, which in our case was easy to do since we were already encapsulating the generic list object (_items). Here's the simple code:

public bool Contains(VI id) 
{
  return IndexOf(id) >= 0;
}

public int IndexOf(VI id) 
{ 
  int i, count;

  count = _items.Count;
  for (i = 0; i < count; i++)
    if (_items[i] == id)
      return i;
  return -1;
}

public bool Remove(VI id) 
{
  int i;

  i = IndexOf(id);
  if (i < 0)
    return false;
  _items.RemoveAt(i);

  return true;
}

The comparison of VI values is now being done in our own version of IndexOf() which requires no boxing, and it's very fast. Our particular program sped up by 20% after this simple re-write. O(n)... no problem! Just avoid the wasted memory usage!

Dictionary isn't that bad, because the keys in a dictionary are designed to be found fast. To find a number in a list it needs to iterate through the whole list.

Of course the dictionary only works if your numbers are unique and not ordered.

I think there is also a HashSet<T> class in .NET 3.5, it also allows only unique elements.

This is not exactly an answer to your question, but I have a class that increases the performance of Contains() on a collection. I subclassed a Queue and added a Dictionary that maps hashcodes to lists of objects. The Dictionary.Contains() function is O(1) whereas List.Contains(), Queue.Contains(), and Stack.Contains() are O(n).

The value-type of the dictionary is a queue holding objects with the same hashcode. The caller can supply a custom class object that implements IEqualityComparer. You could use this pattern for Stacks or Lists. The code would need just a few changes.

/// <summary>
/// This is a class that mimics a queue, except the Contains() operation is O(1) rather     than O(n) thanks to an internal dictionary.
/// The dictionary remembers the hashcodes of the items that have been enqueued and dequeued.
/// Hashcode collisions are stored in a queue to maintain FIFO order.
/// </summary>
/// <typeparam name="T"></typeparam>
private class HashQueue<T> : Queue<T>
{
    private readonly IEqualityComparer<T> _comp;
    public readonly Dictionary<int, Queue<T>> _hashes; //_hashes.Count doesn't always equal base.Count (due to collisions)

    public HashQueue(IEqualityComparer<T> comp = null) : base()
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>();
    }

    public HashQueue(int capacity, IEqualityComparer<T> comp = null) : base(capacity)
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>(capacity);
    }

    public HashQueue(IEnumerable<T> collection, IEqualityComparer<T> comp = null) :     base(collection)
    {
        this._comp = comp;

        this._hashes = new Dictionary<int, Queue<T>>(base.Count);
        foreach (var item in collection)
        {
            this.EnqueueDictionary(item);
        }
    }

    public new void Enqueue(T item)
    {
        base.Enqueue(item); //add to queue
        this.EnqueueDictionary(item);
    }

    private void EnqueueDictionary(T item)
    {
        int hash = this._comp == null ? item.GetHashCode() :     this._comp.GetHashCode(item);
        Queue<T> temp;
        if (!this._hashes.TryGetValue(hash, out temp))
        {
            temp = new Queue<T>();
            this._hashes.Add(hash, temp);
        }
        temp.Enqueue(item);
    }

    public new T Dequeue()
    {
        T result = base.Dequeue(); //remove from queue

        int hash = this._comp == null ? result.GetHashCode() : this._comp.GetHashCode(result);
        Queue<T> temp;
        if (this._hashes.TryGetValue(hash, out temp))
        {
            temp.Dequeue();
            if (temp.Count == 0)
                this._hashes.Remove(hash);
        }

        return result;
    }

    public new bool Contains(T item)
    { //This is O(1), whereas Queue.Contains is (n)
        int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
        return this._hashes.ContainsKey(hash);
    }

    public new void Clear()
    {
        foreach (var item in this._hashes.Values)
            item.Clear(); //clear collision lists

        this._hashes.Clear(); //clear dictionary

        base.Clear(); //clear queue
    }
}

My simple testing shows that my HashQueue.Contains() runs much faster than Queue.Contains(). Running the test code with count set to 10,000 takes 0.00045 seconds for the HashQueue version and 0.37 seconds for the Queue version. With a count of 100,000, the HashQueue version takes 0.0031 seconds whereas the Queue takes 36.38 seconds!

Here's my testing code:

static void Main(string[] args)
{
    int count = 10000;

    { //HashQueue
        var q = new HashQueue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("HashQueue, {0}", sw.Elapsed));
    }

    { //Queue
        var q = new Queue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("Queue,     {0}", sw.Elapsed));
    }

    Console.ReadLine();
}

List<T>.Contains() is very slow?

Related

Recent Posts