How to remove duplicates based on a key in Mongodb?

Solution 1:

This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.

If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.

Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.

Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

Solution 2:

This is the easiest query I used on my MongoDB 3.2

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
    db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})

Index your customKey before running this to increase speed

Solution 3:

While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

  1. Let the MongoDB do that for you using Map Reduce
    • Another way
  2. You do programatically which is less efficient.

Solution 4:

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });