MongoDB $lookup (aggregation): Put multiple matches into array rather than create multiple documents?
Using Ubuntu 21.04, MongoDB Community 4.4.9, pymongo in Python 3.9:
I'm merging data from two collections on one shared key, membershipNumber
. membershipNumber
is associated with a different user-level identifier, an_user_id
, in another collection, and should be unique. However, in many cases, there are n an_user_id
s for a single membershipNumber
. Right now, this means that I have many duplicate membershipNumber
s, causing there to be duplicate documents where everything - apart from an_user_id
- is the same in my newly created collection.
In order to circumvent this issue, I want the following to happen:
- whenever there are >1
an_user_id
s which match a givenmembershipNumber
, I want to create an array that holds ALLan_user_ids
that match a givenmembershipNumber
in a newly created collection (using$out
) - that way, every membershipNumber in the collection will be unique.
One question re the practicality of this also remains: Will this mean I'll be able to $merge
or $insert
data which is linked via an_user_id
and from a different collection/aggregation onto this newly created collection?
Any help would be hugely appreciated. Thanks!
Working code that I have (which however doesn't prevent duplication):
p = [
{
'$project' : {
'_id' : 0,
'membershipNumber' : 1,
'address' : 1,
'joinDate' : 1,
'membershipType' : 1
}
},
# THE JOIN!!
{
'$lookup': {
'from': "an_users", # the other collection
'localField' : 'membershipNumber',
'foreignField' : 'memref',
'as': "details"
}
},
# retain unmatchable cases
{
'$unwind' : {
'path' : '$details',
'preserveNullAndEmptyArrays' : True
}
},
{
'$project' : {
'_id' : 0,
'membershipNumber' : 1,
'home' : 1,
'joinDate' : 1,
'membershipType' : 1,
'an_user_id' : '$details.user_id',
}
},
{
'$out' : {
'db' : 'mydb',
'coll' : 'new_coll'
}
}
]
members.aggregate(pipeline=p)
And this is what the (unwanted) duplicate data look like in the new collection:
{
"_id": 1,
"membershipNumber": "123456",
"membershipType": "STD",
"home: "Hogwarts",
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": "12345"
},
{
"_id": 2,
"membershipNumber": "123456",
"membershipType": "STD",
"home": "Hogwarts"
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": "12346"
}
And this is what I'd like it to look like...
{
"_id": 1,
"membershipNumber": "123456",
"membershipType": "STD",
"home": "Hogwarts"
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": ["12345", "12346"]
}
Solution 1:
Not exactly sure how the $out
conditionally comes into play here, but given two collections as follows:
db.foo.insert([
{_id:1, membershipNumber: 1, type: "STD"},
{_id:3, membershipNumber: 5, type: "STD"},
{_id:8, membershipNumber: 8, type: "STD"}
]);
db.foo2.insert([
{_id:1, memref: 1, an_user_id: 1},
{_id:2, memref: 1, an_user_id: 2},
{_id:3, memref: 1, an_user_id: 3},
{_id:4, memref: 5, an_user_id: 5}
// No lookup for memref 8, just to test
]);
Then this pipeline produces the target output. No initial $project
is required.
db.foo.aggregate([
// Call the join field "an_user_id" because we are going to OVERWRITE
// it in the next stage. This avoids creating extra fields that we will
// want to $unset later to minimize clutter.
{$lookup: {from: "foo2",
localField: "membershipNumber",
foreignField: "memref",
as: "an_user_id"}}
// Turn the big array of objects into an array of just an_user_id:
,{$addFields: {an_user_id: {$map: {
input: "$an_user_id",
in: "$$this.an_user_id"
}}
}}
]);