MongoDB $lookup (aggregation): Put multiple matches into array rather than create multiple documents?

Using Ubuntu 21.04, MongoDB Community 4.4.9, pymongo in Python 3.9:

I'm merging data from two collections on one shared key, membershipNumber. membershipNumber is associated with a different user-level identifier, an_user_id, in another collection, and should be unique. However, in many cases, there are n an_user_ids for a single membershipNumber. Right now, this means that I have many duplicate membershipNumbers, causing there to be duplicate documents where everything - apart from an_user_id - is the same in my newly created collection.

In order to circumvent this issue, I want the following to happen:

  • whenever there are >1 an_user_ids which match a given membershipNumber, I want to create an array that holds ALL an_user_ids that match a given membershipNumber in a newly created collection (using $out)
  • that way, every membershipNumber in the collection will be unique.

One question re the practicality of this also remains: Will this mean I'll be able to $merge or $insert data which is linked via an_user_id and from a different collection/aggregation onto this newly created collection?

Any help would be hugely appreciated. Thanks!

Working code that I have (which however doesn't prevent duplication):

p = [
    {
        '$project' : {
            '_id' : 0,
            'membershipNumber' : 1,
            'address' : 1,
            'joinDate' : 1,
            'membershipType' : 1
        }
    },
    # THE JOIN!!    
    {
        '$lookup': {
            'from': "an_users", # the other collection
            'localField' :  'membershipNumber',
            'foreignField' : 'memref',
            'as': "details"
        }
    },
    # retain unmatchable cases    
    {
        '$unwind' : {
            'path' : '$details',
            'preserveNullAndEmptyArrays' : True
        } 
    },
    {
        '$project' : {
            '_id' : 0,
            'membershipNumber' : 1,
            'home' : 1,
            'joinDate' : 1,
            'membershipType' : 1,
            'an_user_id' : '$details.user_id',
        }
    },
    {
        '$out' : {
            'db' : 'mydb',
            'coll' : 'new_coll'
        }
    }
]

members.aggregate(pipeline=p)

And this is what the (unwanted) duplicate data look like in the new collection:

{
    "_id": 1,
    "membershipNumber": "123456",
    "membershipType": "STD",
    "home: "Hogwarts",
    "joinDate": {
        "$date": "2000-01-01T00:00:00.000Z"
    },
    "an_user_id": "12345"
},
{
    "_id": 2,
    "membershipNumber": "123456",
    "membershipType": "STD",
    "home": "Hogwarts"
    "joinDate": {
        "$date": "2000-01-01T00:00:00.000Z"
    },
    "an_user_id": "12346"
}

And this is what I'd like it to look like...

{
    "_id": 1,
    "membershipNumber": "123456",
    "membershipType": "STD",
    "home": "Hogwarts"
    "joinDate": {
        "$date": "2000-01-01T00:00:00.000Z"
    },
    "an_user_id": ["12345", "12346"] 
}


Solution 1:

Not exactly sure how the $out conditionally comes into play here, but given two collections as follows:

db.foo.insert([
    {_id:1, membershipNumber: 1, type: "STD"},
    {_id:3, membershipNumber: 5, type: "STD"},
    {_id:8, membershipNumber: 8, type: "STD"}
]);

db.foo2.insert([
    {_id:1, memref: 1, an_user_id: 1},
    {_id:2, memref: 1, an_user_id: 2},
    {_id:3, memref: 1, an_user_id: 3},
    {_id:4, memref: 5, an_user_id: 5}
    // No lookup for memref 8, just to test                                                            
]);

Then this pipeline produces the target output. No initial $project is required.

db.foo.aggregate([
    // Call the join field "an_user_id" because we are going to OVERWRITE                              
    // it in the next stage.  This avoids creating extra fields that we will                           
    // want to $unset later to minimize clutter.                                                       
    {$lookup: {from: "foo2",
               localField: "membershipNumber",
               foreignField: "memref",
               as: "an_user_id"}}

    // Turn the big array of objects into an array of just an_user_id:                                 
    ,{$addFields: {an_user_id: {$map: {
                                input: "$an_user_id",
                                in: "$$this.an_user_id"
                                }}
    }}
]);