Poor lookup aggregation performance

I have two collections

Posts:

{
    "_Id": "1",
    "_PostTypeId": "1",
    "_AcceptedAnswerId": "192",
    "_CreationDate": "2012-02-08T20:02:48.790",
    "_Score": "10",
    ...
    "_OwnerUserId": "6",
    ...
},
...

and users:

{
    "_Id": "1",
    "_Reputation": "101",
    "_CreationDate": "2012-02-08T19:45:13.447",
    "_DisplayName": "Geoff Dalgas",
    ...
    "_AccountId": "2"
},
...

and I want to find users who write between 5 and 15 posts. This is how my query looks like:

db.posts.aggregate([
    {
        $lookup: {
            from: "users", 
            localField: "_OwnerUserId",
            foreignField: "_AccountId", 
            as: "X"
        }
    },  
    {
        $group: {
            _id: "$X._AccountId", 
            posts: { $sum: 1 }
        }
    },   
    {
        $match : {posts: {$gte: 5, $lte: 15}}
    },  
    {
        $sort: {posts: -1 }
    },
    {
        $project : {posts: 1}
    }
])

and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second. Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.


Solution 1:

from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

foreignField Specifies the field from the documents in the from collection. $lookup performs an equality match on the foreignField to the localField from the input documents. If a document in the from collection does not contain the foreignField, the $lookup treats the value as null for matching purposes.

This will be performed the same as any other query.

If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.

db.users.ensureIndex("_AccountId", 1) 

speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.

Solution 2:

In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.

The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!

Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.