List<T>.Contains() is very slow?
Could anyone explain me why the generics List.Contains()
function is so slow?
I have a List<long>
with about a million numbers, and the code that is constantly checking if there's a specific number within these numbers.
I tried doing the same thing using Dictionary<long, byte>
and the Dictionary.ContainsKey()
function, and it was about 10-20 times faster than with the List.
Of course, I don't really want to use Dictionary for that purpose, because it wasn't meant to be used that way.
So, the real question here is, is there any alternative to the List<T>.Contains()
, but not as whacky as Dictionary<K,V>.ContainsKey()
?
If you are just checking for existence, HashSet<T>
in .NET 3.5 is your best option - dictionary-like performance, but no key/value pair - just the values:
HashSet<int> data = new HashSet<int>();
for (int i = 0; i < 1000000; i++)
{
data.Add(rand.Next(50000000));
}
bool contains = data.Contains(1234567); // etc
List.Contains is a O(n) operation.
Dictionary.ContainsKey is a O(1) operation, since it uses the hashcode of the objects as a key, which gives you a quicker search ability.
I don't think that it 's a good idea to scan through a List which contains a million entries to find a few entries.
Isn't it possible to save those millon entities into a RDBMS for instance, and perform queries on that database ?
If it is not possible, then I would use a Dictionary anyway if you want to do key-lookups.
I think I have the answer! Yes, it's true that Contains() on a list (array) is O(n), but if the array is short and you are using value types, it still should be quite fast. But using the CLR Profiler [free download from Microsoft], I discovered that Contains() is boxing values in order to compare them, which requires heap allocation, which is VERY expensive (slow). [Note: This is .Net 2.0; other .Net versions not tested.]
Here's the full story and solution. We have an enumeration called "VI" and made a class called "ValueIdList" which is an abstract type for a list (array) of VI objects. The original implementation was in the ancient .Net 1.1 days, and it used an encapsulated ArrayList. We discovered recently in http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx that a generic list (List<VI>) performs much better than ArrayList on value types (like our enum VI) because the values don't have to be boxed. It's true and it worked... almost.
The CLR Profiler revealed a surprise. Here's a portion of the Allocation Graph:
- ValueIdList::Contains bool(VI) 5.5MB (34.81%)
- Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
- Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
- Values.VI 7.7MB (49.03%)
As you can see, Contains() surprisingly calls Generic.ObjectEqualityComparer.Equals(), which apparently requires the boxing of a VI value, which requires expensive heap allocation. It's odd that Microsoft would eliminate boxing on the list, only to require it again for a simple operation such as this.
Our solution was to re-write the Contains() implementation, which in our case was easy to do since we were already encapsulating the generic list object (_items). Here's the simple code:
public bool Contains(VI id)
{
return IndexOf(id) >= 0;
}
public int IndexOf(VI id)
{
int i, count;
count = _items.Count;
for (i = 0; i < count; i++)
if (_items[i] == id)
return i;
return -1;
}
public bool Remove(VI id)
{
int i;
i = IndexOf(id);
if (i < 0)
return false;
_items.RemoveAt(i);
return true;
}
The comparison of VI values is now being done in our own version of IndexOf() which requires no boxing, and it's very fast. Our particular program sped up by 20% after this simple re-write. O(n)... no problem! Just avoid the wasted memory usage!
Dictionary isn't that bad, because the keys in a dictionary are designed to be found fast. To find a number in a list it needs to iterate through the whole list.
Of course the dictionary only works if your numbers are unique and not ordered.
I think there is also a HashSet<T>
class in .NET 3.5, it also allows only unique elements.
This is not exactly an answer to your question, but I have a class that increases the performance of Contains() on a collection. I subclassed a Queue and added a Dictionary that maps hashcodes to lists of objects. The Dictionary.Contains()
function is O(1) whereas List.Contains()
, Queue.Contains()
, and Stack.Contains()
are O(n).
The value-type of the dictionary is a queue holding objects with the same hashcode. The caller can supply a custom class object that implements IEqualityComparer. You could use this pattern for Stacks or Lists. The code would need just a few changes.
/// <summary>
/// This is a class that mimics a queue, except the Contains() operation is O(1) rather than O(n) thanks to an internal dictionary.
/// The dictionary remembers the hashcodes of the items that have been enqueued and dequeued.
/// Hashcode collisions are stored in a queue to maintain FIFO order.
/// </summary>
/// <typeparam name="T"></typeparam>
private class HashQueue<T> : Queue<T>
{
private readonly IEqualityComparer<T> _comp;
public readonly Dictionary<int, Queue<T>> _hashes; //_hashes.Count doesn't always equal base.Count (due to collisions)
public HashQueue(IEqualityComparer<T> comp = null) : base()
{
this._comp = comp;
this._hashes = new Dictionary<int, Queue<T>>();
}
public HashQueue(int capacity, IEqualityComparer<T> comp = null) : base(capacity)
{
this._comp = comp;
this._hashes = new Dictionary<int, Queue<T>>(capacity);
}
public HashQueue(IEnumerable<T> collection, IEqualityComparer<T> comp = null) : base(collection)
{
this._comp = comp;
this._hashes = new Dictionary<int, Queue<T>>(base.Count);
foreach (var item in collection)
{
this.EnqueueDictionary(item);
}
}
public new void Enqueue(T item)
{
base.Enqueue(item); //add to queue
this.EnqueueDictionary(item);
}
private void EnqueueDictionary(T item)
{
int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
Queue<T> temp;
if (!this._hashes.TryGetValue(hash, out temp))
{
temp = new Queue<T>();
this._hashes.Add(hash, temp);
}
temp.Enqueue(item);
}
public new T Dequeue()
{
T result = base.Dequeue(); //remove from queue
int hash = this._comp == null ? result.GetHashCode() : this._comp.GetHashCode(result);
Queue<T> temp;
if (this._hashes.TryGetValue(hash, out temp))
{
temp.Dequeue();
if (temp.Count == 0)
this._hashes.Remove(hash);
}
return result;
}
public new bool Contains(T item)
{ //This is O(1), whereas Queue.Contains is (n)
int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
return this._hashes.ContainsKey(hash);
}
public new void Clear()
{
foreach (var item in this._hashes.Values)
item.Clear(); //clear collision lists
this._hashes.Clear(); //clear dictionary
base.Clear(); //clear queue
}
}
My simple testing shows that my HashQueue.Contains()
runs much faster than Queue.Contains()
. Running the test code with count set to 10,000 takes 0.00045 seconds for the HashQueue version and 0.37 seconds for the Queue version. With a count of 100,000, the HashQueue version takes 0.0031 seconds whereas the Queue takes 36.38 seconds!
Here's my testing code:
static void Main(string[] args)
{
int count = 10000;
{ //HashQueue
var q = new HashQueue<int>(count);
for (int i = 0; i < count; i++) //load queue (not timed)
q.Enqueue(i);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
bool contains = q.Contains(i);
}
sw.Stop();
Console.WriteLine(string.Format("HashQueue, {0}", sw.Elapsed));
}
{ //Queue
var q = new Queue<int>(count);
for (int i = 0; i < count; i++) //load queue (not timed)
q.Enqueue(i);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
bool contains = q.Contains(i);
}
sw.Stop();
Console.WriteLine(string.Format("Queue, {0}", sw.Elapsed));
}
Console.ReadLine();
}