What is the best way to remove duplicates from a datatable?

I have checked the whole site and googled on the net but was unable to find a simple solution to this problem.

I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on all the columns.

It would be great if someone could help me with this.

EDIT: I am sorry for not being clear on this. This datatable is being created by reading a CSV file and not from a DB. So using an SQL query is not an option.

You can use Linq to Datasets. Check this. Something like this:

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                              select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

DataTable table =
    System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
    table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
    Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}

How can I remove duplicate rows?. (Adjust the query there to join on your 4 key columns)

EDIT: with your new information I believe the easiest way would be to implement IEqualityComparer<T> and use Distinct on your data rows. Otherwise if you're working with IEnumerable/IList instead of DataTable/DataRow, it is certainly possible with some LINQ-to-objects kung-fu.

EDIT: example IEqualityComparer

public class MyRowComparer : IEqualityComparer<DataRow>
{

    public bool Equals(DataRow x, DataRow y)
    {
        return (x.Field<int>("ID") == y.Field<int>("ID")) &&
            string.Compare(x.Field<string>("Name"), y.Field<string>("Name"), true) == 0 &&
          ... // extend this to include all your 4 keys...
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<int>("ID").GetHashCode() ^ obj.Field<string>("Name").GetHashCode() etc.
    }
}

You can use it like this:

var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);

If you have access to Linq I think you should be able to use the built in group functionality on the in memory collection and pick out the duplicate rows

Search Google for Linq Group by for examples

It should be taken into account that Table.AcceptChanges() must be called to complete the deletion. Otherwise deleted row is still present in DataTable with RowState set to Deleted. And Table.Rows.Count is not changed after deletion.

What is the best way to remove duplicates from a datatable?

Related

Recent Posts