Delete multiple documents based on group by and sort operation
I'm new here with MongoDB I need your help. I want to delete multiple documents from a collection based on certain condition i.e. grouping based on empID and deleting the records except latest record for that empID.
Suppose my dataset looks like:
[
{
_id: 1,
employeeId: "50052196",
name: "abc",
lastUpdatedOn: ISODate("2022-01-10T10:25:38.859+00:00")
},
{
_id: 2,
employeeId: "50052197",
name: "xyz",
lastUpdatedOn: ISODate("2022-01-17T10:25:38.859+00:00")
},
{
_id: 3,
employeeId: "50052198",
name: "pqr",
lastUpdatedOn: ISODate("2022-01-17T10:25:38.859+00:00")
},
{
_id: 4,
employeeId: "50052196",
name: "abc",
lastUpdatedOn: ISODate("2022-01-12T11:30:58.435+00:00")
},
{
_id: 5,
employeeId: "50052196",
name: "abc",
lastUpdatedOn: ISODate("2022-01-15T15:45:00.159+00:00")
}
]
In above dataset, employeeID : 50052196
is duplicate document which got inserted periodically along with field lastUpdatedOn
. As lastUpdatedOn
is nothing but record/document creation date.
Now, I want to keep the document having latest lastUpdatedOn value i.e. lastUpdatedOn:2022-01-15T15:45:00.159+00:00
for employeeID : 50052196
.
I have gone through MongoDB documentation and I found that we are not able to write delete query along with group by and sort operation Or if we choose Aggregate Pipeline then we are not able to delete documents inside aggregate.
I was checking the option of Bulk.find.remove()
, but again I'm not able to figure out how to group the employeeId.
Another approach I had in mind, based on some condition, I will add one field i.e. isActive: true
. And using normal deleteMany()
operation I can delete isActive: false
documents. But as per policy I can not modify the documents.
The insertion operation in the collection is handled by third party-application, which we can not modify.
As the documents are larger in numbers, and I wanted to write a code which will be less time and resource consuming at the same time it should be adhering to best practices. Help is much appreciated. Can anyone please suggest me good option to tackle out this situation. Help is much appreciated.
NOTE: Same type of deletion code I have to write for multiple MongoDB collections. It is kind of clean-up script that I'm thinking of to write.
You can use whatever pipeline you want and add an $out stage at the end, this will overwrite the chosen collection.
for example:
db.collection.aggregate([
{
$sort: {
lastUpdatedOn: -1
}
},
{
$group: {
_id: "$employeeId",
root: {
$first: "$$ROOT"
}
}
},
{
$replaceRoot: {
newRoot: "$root"
}
},
{
$out: "collection"
}
])
Mongo Playground
A more controlled way in code, iterate user by user:
const userIds = await db.collection.distinct('employeeId');
for (let i = 0; i < userIds.length; i++) {
const userId = userIds[i];
const employeeRecords = await db.collection.find({ 'employeeId': userId }).sort({ lastUpdatedOn: -1 }).toArray();
employeeRecords.pop();
await db.collection.deleteMany({ _id: { $in: employeeRecords.map(v => v._id) } });
}
(this should run in parallel. It's written in a for loop for clarity)