Find largest document size in MongoDB

Is it possible to find the largest document size in MongoDB?

db.collection.stats() shows average size, which is not really representative because in my case sizes can differ considerably.


Solution 1:

You can use a small shell script to get this value.

Note: this will perform a full table scan, which will be slow on large collections.

let max = 0, id = null;
db.test.find().forEach(doc => {
    const size = Object.bsonsize(doc); 
    if(size > max) {
        max = size;
        id = doc._id;
    } 
});
print(id, max);

Solution 2:

Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.

If you also want the _id, try this. Given a collection called "requests" :

// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();

// { "size" : 3333, "_id" : "someUniqueIdHere" }

Solution 3:

Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).

You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes:

  • arrays
  • strings

The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.

Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):

db.users.aggregate([
    { $match: {
        'friends.ids': { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $size: '$friends.ids' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:

db.users.aggregate([
    { $match: {
        bio: { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $strLenBytes: '$bio' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.


1 Technically, the rarely used binData type can also have variable size.