Getting the "diff" between two arrays in C#?

Let's say I have these two arrays:

var array1 = new[] {"A", "B", "C"};
var array2 = new[] {"A", "C", "D"};

I would like to get the differences between the two. I know I could write this in just a few lines of code, but I want to make sure I'm not missing a built in language feature or a LINQ extension method.

Ideally, I would end up with the following three results:

  • Items not in array1, but are in array2 ("D")
  • Items not in array2, but are in array1 ("B")
  • Items that are in both

Thanks in advance!


If you've got LINQ available to you, you can use Except and Distinct. The sets you asked for in the question are respectively:

- array2.Except(array1)
- array1.Except(array2)
- array1.Intersect(array2)

from the MSDN 101 LINQ samples....

public void Linq52() {
    int[] numbersA = { 0, 2, 4, 5, 6, 8, 9 };
    int[] numbersB = { 1, 3, 5, 7, 8 };

    IEnumerable<int> aOnlyNumbers = numbersA.Except(numbersB);

    Console.WriteLine("Numbers in first array but not second array:");
    foreach (var n in aOnlyNumbers) {
        Console.WriteLine(n);
    }
}

Here are the benchmarks of LINQ extension methods. The results were obtained during the development of a real program.

The tests: 2 lists (lst1 and lst2) each approximately 250000 objects. Each object (class Key) contains a string and an integer. The second list mostly contains the same entries as the first one, but some new entries are added and some are removed.

I tested the Except extension method.

var except = lst2.Except(lst1);

List lst = except.ToList();

These 2 lines produced 600 items list of “new additions”. I timed it using the StopWatch object. The speed is astonishing:220 ms. The computer I used is by no means a “speedy Gonzales”. Core 2 Duo T7700 – 2.4GHz.

Note:

Here is the class Key, which implements IEquatable i-face.

public class Key : IEquatable<Key>
{
    public int Index { get; private set; }
    public string Name { get; private set; }

    public Key(string keyName, int sdIndex)
    {
        this.Name = keyName;
        this.Index = sdIndex;
    }

 // IEquatable implementation
    public bool Equals(Key other)
    {
        //Check whether the compared object is null.
        if (Object.ReferenceEquals(other, null)) return false;
        //Check whether the compared object references the same data.
        if (Object.ReferenceEquals(this, other)) return true;
        //Check whether the products' properties are equal.
        return Index.Equals(other.Index) && Name.Equals(other.Name);
    }

    // If Equals() returns true for a pair of objects 
    // then GetHashCode() must return the same value for these objects.
    public override int GetHashCode()
    {
        //Get hash code for the name field if it is not null.
        int hashKeyName = Name == null ? 0 : Name.GetHashCode();
        //Get hash code for the index field.
        int hashKeyIndex = Index.GetHashCode();
        //Calculate the hash code for the Key.
        return hashKeyName ^ hashKeyIndex;
    }
}

I've had to do things similar to this with very large sets of data. If you're dealing with a few thousand or so, use the Linq stuff since it's much clearer. But if you know that your arrays are pre-sorted, running a merge like this can do it significantly faster, since it only makes one pass through the data and doesn't need to allocate as much memory as the Linq version.

int iA = 0;
int iB = 0;
List<int> inA = new List<int>();
List<int> inB = new List<int>();
List<int> inBoth = new List<int>();
while (iA < numbersA.Length && iB < numbersB.Length)
{
    if (numbersA[iA] < numbersB[iB])
    {
        inA.Add(numbersA[iA++]);
    }
    else if (numbersA[iA] == numbersB[iB])
    {
        inBoth.Add(numbersA[iA++]);
        ++iB;
    }
    else
    {
        inB.Add(numbersB[iB++]);
    }
}
while (iA < numbersA.Length)
{
    inA.Add(numbersA[iA++]);
}
while (iB < numbersB.Length)
{
    inB.Add(numbersB[iB++]);
}

Again, this is really only needed if you are dealing with hundreds of thousands of values.