Fastest way to remove duplicate documents in mongodb

dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }},
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

b. You can delete documents one by one.

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

UPDATE

This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).

Create collection dump with mongodump

Clear collection

Add unique index

Restore collection with mongorestore

I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.

It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).

The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.

I had to add allowDiskUse because my collection is large.

You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "

You can save the results to a new collection by adding an $out stage...

Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

My DB had millions of duplicate records. @somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

Fastest way to remove duplicate documents in mongodb

Related

Recent Posts