Except has similar effect to Distinct?

I just discovered that Except() will remove all elements in the second list from the first, but it also has the effect that it makes all elements in the returned result distinct.

Simple way around I am using is Where(v => !secondList.Contains(v))

Can anyone explain to me why this is the behavior, and if possible point me to the documentation that explains this?


Solution 1:

The documentation for the Except function states:

Produces the set difference of two sequences by using the default equality comparer to compare values.

The set difference of two sets is defined as the members of the first set that do not appear in the second set.

The important word here is set, which is defined as:

...an abstract data structure that can store certain values, without any particular order, and no repeated values...

Because Except is documented as a set-based operation, it also has the effect of making the resulting values distinct.

Solution 2:

You wrote:

Simple way around I am using is Where(v => !secondList.Contains(v))

When you do this, there is still Distict done with secondList.

For example:

var firstStrings = new [] { "1", null, null, null, "3", "3" };
var secondStrings = new [] { "1", "1", "1", null, null, "4" };
var resultStrings = firstStrings.Where(v => !secondStrings.Contains(v)); // 3, 3  

I created an extension method to have no distinct at all. Examle of usage:

var result2Strings = firstStrings.ExceptAll(secondStrings).ToList(); // null, 3, 3

This is what it does:

enter image description here

This is the source:

public static IEnumerable<TSource> ExceptAll<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    // Do not call reuse the overload method because that is a slower imlementation
    if (first == null) { throw new ArgumentNullException("first"); }
    if (second == null) { throw new ArgumentNullException("second"); }

    var secondList = second.ToList();
    return first.Where(s => !secondList.Remove(s));
}

public static IEnumerable<TSource> ExceptAll<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null) { throw new ArgumentNullException("first"); }
    if (second == null) { throw new ArgumentNullException("second"); }
    var comparerUsed = comparer ?? EqualityComparer<TSource>.Default;

    var secondList = second.ToList();
    foreach (var item in first)
    {
        if (secondList.Contains(item, comparerUsed))
        {
            secondList.Remove(item);
        }
        else
        {
            yield return item;
        }
    }
}

Edit: A faster implemetation, based on the comment of DigEmAll

public static IEnumerable<TSource> ExceptAll<TSource>(
        this IEnumerable<TSource> first,
        IEnumerable<TSource> second)
{
    return ExceptAll(first, second, null);
}

public static IEnumerable<TSource> ExceptAll<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null) { throw new ArgumentNullException("first"); }
    if (second == null) { throw new ArgumentNullException("second"); }


    var secondCounts = new Dictionary<TSource, int>(comparer ?? EqualityComparer<TSource>.Default);
    int count;
    int nullCount = 0;

    // Count the values from second
    foreach (var item in second)
    {
        if (item == null)
        {
            nullCount++;
        }
        else
        {
            if (secondCounts.TryGetValue(item, out count))
            {
                secondCounts[item] = count + 1;
            }
            else
            {
                secondCounts.Add(item, 1);
            } 
        }
    }

    // Yield the values from first
    foreach (var item in first)
    {
        if (item == null)
        {
            nullCount--;
            if (nullCount < 0)
            {
                yield return item;
            } 
        }
        else
        {
            if (secondCounts.TryGetValue(item, out count))
            {
                if (count == 0)
                {
                    secondCounts.Remove(item);
                    yield return item;
                }
                else
                {
                    secondCounts[item] = count - 1;
                }
            }
            else
            {
                yield return item;
            }
        }
    }
}

More info on my blog (also variant for Intersect and Union)

Solution 3:

Given A = [1, 2, 2, 3, 3, 3] and B = [3].

  • A.Except(B); returns [1, 2] as Greg Beech explained in his response
  • A.ExceptAll(B); from Alex Siepman response, returns [1, 2, 2, 3, 3] (and I find the name ambiguous).
  • A.Where(v => !B.Contains(v)) from OP work around returns [1, 2, 2]

I suppose that OP work around is the desired behavior, and this one has not be treated.

The main issue with OP work around is that List<T>.Contains(T) is O(n) and Where is also O(n) making the solution O(n²) in time (for A and B of equivalent sizes) and O(1) in memory.

We can make it O(n) in time and O(n) in memory by using hash set:

// I accept any better name for this method
public static IEnumerable<TSource> ExceptFrom<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
        throw new ArgumentNullException(nameof(first));

    if (second == null)
        throw new ArgumentNullException(nameof(second));

    var secondSet = second as HashSet<TSource> ?? // this trick ignore the comparer
                    second.ToHashSet(comparer ?? EqualityComparer<TSource>.Default);

    // Contains is O(1) for HashSet.
    return first.Where(v => !secondSet.Contains(v));
}