What guarantees are there on the run-time complexity (Big-O) of LINQ methods?
I've recently started using LINQ quite a bit, and I haven't really seen any mention of run-time complexity for any of the LINQ methods. Obviously, there are many factors at play here, so let's restrict the discussion to the plain IEnumerable
LINQ-to-Objects provider. Further, let's assume that any Func
passed in as a selector / mutator / etc. is a cheap O(1) operation.
It seems obvious that all the single-pass operations (Select
, Where
, Count
, Take/Skip
, Any/All
, etc.) will be O(n), since they only need to walk the sequence once; although even this is subject to laziness.
Things are murkier for the more complex operations; the set-like operators (Union
, Distinct
, Except
, etc.) work using GetHashCode
by default (afaik), so it seems reasonable to assume they're using a hash-table internally, making these operations O(n) as well, in general. What about the versions that use an IEqualityComparer
?
OrderBy
would need a sort, so most likely we're looking at O(n log n). What if it's already sorted? How about if I say OrderBy().ThenBy()
and provide the same key to both?
I could see GroupBy
(and Join
) using either sorting, or hashing. Which is it?
Contains
would be O(n) on a List
, but O(1) on a HashSet
- does LINQ check the underlying container to see if it can speed things up?
And the real question - so far, I've been taking it on faith that the operations are performant. However, can I bank on that? STL containers, for example, clearly specify the complexity of every operation. Are there any similar guarantees on LINQ performance in the .NET library specification?
More question (in response to comments):
Hadn't really thought about overhead, but I didn't expect there to be very much for simple Linq-to-Objects. The CodingHorror post is talking about Linq-to-SQL, where I can understand parsing the query and making SQL would add cost - is there a similar cost for the Objects provider too? If so, is it different if you're using the declarative or functional syntax?
Solution 1:
There are very, very few guarantees, but there are a few optimizations:
Extension methods that use indexed access, such as
ElementAt
,Skip
,Last
orLastOrDefault
, will check to see whether or not the underlying type implementsIList<T>
, so that you get O(1) access instead of O(N).The
Count
method checks for anICollection
implementation, so that this operation is O(1) instead of O(N).Distinct
,GroupBy
Join
, and I believe also the set-aggregation methods (Union
,Intersect
andExcept
) use hashing, so they should be close to O(N) instead of O(N²).Contains
checks for anICollection
implementation, so it may be O(1) if the underlying collection is also O(1), such as aHashSet<T>
, but this is depends on the actual data structure and is not guaranteed. Hash sets override theContains
method, that's why they are O(1).OrderBy
methods use a stable quicksort, so they're O(N log N) average case.
I think that covers most if not all of the built-in extension methods. There really are very few performance guarantees; Linq itself will try to take advantage of efficient data structures but it isn't a free pass to write potentially inefficient code.
Solution 2:
I've long known that .Count()
returns .Count
if the enumeration is an IList
.
But I was always a bit weary about the run-time complexity of the Set operations: .Intersect()
, .Except()
, .Union()
.
Here's the decompiled BCL (.NET 4.0/4.5) implementation for .Intersect()
(comments mine):
private static IEnumerable<TSource> IntersectIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource source in second) // O(M)
set.Add(source); // O(1)
foreach (TSource source in first) // O(N)
{
if (set.Remove(source)) // O(1)
yield return source;
}
}
Conclusions:
- the performance is O(M + N)
- the implementation doesn't take advantage when the collections already are sets. (It may not be necessarily straightforward, because the used
IEqualityComparer<T>
also needs to match.)
For completeness, here are the implementations for .Union()
and .Except()
.
Spoiler alert: they, too, have O(N+M) complexity.
private static IEnumerable<TSource> UnionIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource source in first)
{
if (set.Add(source))
yield return source;
}
foreach (TSource source in second)
{
if (set.Add(source))
yield return source;
}
}
private static IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource source in second)
set.Add(source);
foreach (TSource source in first)
{
if (set.Add(source))
yield return source;
}
}
Solution 3:
All you can really bank on is that the Enumerable methods are well-written for the general case and won't use naive algorithms. There is probably third-party stuff (blogs, etc.) that describe the algorithms actually in use, but these are not official or guaranteed in the sense that STL algorithms are.
To illustrate, here is the reflected source code (courtesy of ILSpy) for Enumerable.Count
from System.Core:
// System.Linq.Enumerable
public static int Count<TSource>(this IEnumerable<TSource> source)
{
checked
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
return collection.Count;
}
ICollection collection2 = source as ICollection;
if (collection2 != null)
{
return collection2.Count;
}
int num = 0;
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
num++;
}
}
return num;
}
}
As you can see, it goes to some effort to avoid the naive solution of simply enumerating every element.
Solution 4:
I just broke out reflector and they do check the underlying type when Contains
is called.
public static bool Contains<TSource>(this IEnumerable<TSource> source, TSource value)
{
ICollection<TSource> is2 = source as ICollection<TSource>;
if (is2 != null)
{
return is2.Contains(value);
}
return source.Contains<TSource>(value, null);
}
Solution 5:
The correct answer is "it depends". it depends on what type the underlying IEnumerable is. i know that for some collections (like collections that implement ICollection or IList) there are special codepaths that are used, However the actual implementation is not guaranteed to do anything special. for example i know that ElementAt() has a special case for indexable collections, similarly with Count(). But in general you should probably assume the worst case O(n) performance.
In generaly i don't think you are going to find the kind of performance guarantees you want, though if you do run into a particular performance problem with a linq operator you can always just reimplement it for your particular collection. Also there are many blogs and extensibility projects which extend Linq to Objects to add these kinds of performance guarantees. check out Indexed LINQ which extends and adds to the operator set for more performance benefits.