Mongoose populate vs object nesting

Is there any performance difference (process time of query) between using Mongoose population and direct object inclusion ? When should each be used ?

Mongoose population example:

var personSchema = Schema({
  _id     : Number,
  name    : String,
  stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});

var storySchema = Schema({
  _creator : { type: Number, ref: 'Person' },
  title    : String,
});

Mongoose object nesting example:

var personSchema = Schema({
  _id     : Number,
  name    : String,
  stories : [storySchema]
});

var storySchema = Schema({
  _creator : personSchema,
  title    : String,
});

The first thing to understand about mongoose population is that it is not magic, but just a convenience method that allows you to retrieve related information without doing it all yourself.

The concept is essentially for use where you decide you are going to need to place data in a separate collection rather than embedding that data, and your main considerations should be typically on document size or where that related information is subject to frequent updates that would make maintaining embedded data unwieldy.

The "not magic" part is that essentially what happens under the covers is that when you "reference" another source, the populate function makes an additional query/queries to that "related" collection in order to "merge" those results of the parent object that you have retrieved. You could do this yourself, but the method is there for convenience to simplify the task. The obvious "performance" consideration is that there is not a single round trip to the database (MongoDB instance) in order to retrieve all the information. There is always more than one.

As a sample, take two collections:

{ 
    "_id": ObjectId("5392fea00ff066b7d533a765"),
    "customerName": "Bill",
    "items": [
        ObjectId("5392fee10ff066b7d533a766"),
        ObjectId("5392fefe0ff066b7d533a767")
    ]
}

And the items:

{ "_id": ObjectId("5392fee10ff066b7d533a766"), "prod": "ABC", "qty": 1 }
{ "_id": ObjectId("5392fefe0ff066b7d533a767"), "prod": "XYZ", "qty": 2 }

The "best" that can be done by a "referenced" model or the use of populate (under the hood) is this:

var order = db.orders.findOne({ "_id": ObjectId("5392fea00ff066b7d533a765") });
order.items = db.items.find({ "_id": { "$in": order.items } ).toArray();

So there are clearly "at least" two queries and operations in order to "join" that data.

The embedding concept is essentially the MongoDB answer to how to deal with not supporting "joins"1. So that rather that split data into normalized collections you try to embed the "related" data directly within the document that uses it. The advantages here are that there is a single "read" operation for retrieving the "related" information, and also a single point of "write" operations to both update "parent" and "child" entries, though often not possible to write to "many" children at once without processing "lists" on the client or otherwise accepting "multiple" write operations, and preferably in "batch" processing.

Data then rather looks like this ( compared to the example above ):

{ 
    "_id": ObjectId("5392fea00ff066b7d533a765"),
    "customerName": "Bill",
    "items": [
        { "_id": ObjectId("5392fee10ff066b7d533a766"), "prod": "ABC", "qty": 1 },
        { "_id": ObjectId("5392fefe0ff066b7d533a767"), "prod": "XYZ", "qty": 2 }
    ]
}

Therefore actually fetching the data is just a matter of:

db.orders.findOne({ "_id": ObjectId("5392fea00ff066b7d533a765") });

The pros and cons of either will always largely depend on the usage pattern of your application. But at a glance:

Embedding

  • Total document size with embedded data will typically not exceed 16MB of storage (the BSON limit) or otherwise ( as a guideline ) have arrays that contain 500 or more entries.

  • Data that is embedded does generally not require frequent changes. So you could live with "duplication" that comes from the de-normalization not resulting in the need to update those "duplicates" with the same information across many parent documents just to invoke a change.

  • Related data is frequently used in association with the parent. Which means that if your "read/write" cases are pretty much always needing to "read/write" to both parent and child then it makes sense to embed the data for atomic operations.

Referencing

  • The related data is always going to exceed the 16MB BSON limit. You can always consider a hybrid approach of "bucketing", but the general hard limit of the main document cannot be breached. Common cases are "post" and "comments" where "comment" activity is expected to be very large.

  • Related data needs regular updating. Or essentially the case where you "normalize" because that data is "shared" among many parents and the "related" data is changed frequently enough that it would be impractical to update embedded items in every "parent" where that "child" item occurs. The easier case is to just reference the "child" and make the change once.

  • There is a clear separation of reads and writes. In the case where maybe you are not going to always require that "related" information when reading the "parent" or otherwise to not need to always alter the "parent" when writing to the child, there could be good reason to separate the model as referenced. Additionally if there is a general desire to update many "sub-documents" at once in which where those "sub-documents" are actually references to another collection, then quite often the implementation is more efficient to do when the data is in a separate collection.

So there actually is a much wider discussion of the "pros/cons" for either position on the MongoDB documentation on Data Modelling, which covers various use cases and ways to approach either using embedding or referenced model as is supported by the populate method.

Hopefully the "dot points" are of use, but the generally recommendation is to consider the data usage patterns of your application and choose what is best. Having the "option" to embed "should" be the reason you have chosen MongoDB, but it will actually be how your application "uses the data" that makes the decision to which method suits which part of your data modelling (as it is not "all or nothing") the best.

  1. Note that since this was originally written MongoDB introduced the $lookup operator which does indeed perform "joins" between collections on the server. For the purposes of the general discussion here, whist "better" in most circumstances that the "multiple query" overhead incurred by populate() and "multiple queries" in general, there still is a "significant overhead" incurred with any $lookup operation.

The core design principle is "embedded" means "already there" as opposed to "fetching from somewhere else". Essentially the difference between "in your pocket" and "on the shelf", and in I/O terms usually more like "on the shelf in the library downtown", and notably further away for network based requests.