Efficiently merge string arrays in .NET, keeping distinct values
I'm using .NET 3.5. I have two string arrays, which may share one or more values:
string[] list1 = new string[] { "apple", "orange", "banana" };
string[] list2 = new string[] { "banana", "pear", "grape" };
I'd like a way to merge them into one array with no duplicate values:
{ "apple", "orange", "banana", "pear", "grape" }
I can do this with LINQ:
string[] result = list1.Concat(list2).Distinct().ToArray();
but I imagine that's not very efficient for large arrays.
Is there a better way?
string[] result = list1.Union(list2).ToArray();
from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."
Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.
I'm not sure how you'd manage to make it more efficient than that in a general way :)
EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.
The effect is the equivalent of:
public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
HashSet<T> returned = new HashSet<T>();
foreach (T element in first)
{
if (returned.Add(element))
{
yield return element;
}
}
foreach (T element in second)
{
if (returned.Add(element))
{
yield return element;
}
}
}