How to efficiently export 3 million documents from MongoDB Atlas collection to CSV file using NodeJS
I personally would NEVER recommend using .toArray()
to query an indefinite/very large no. of documents, as you're asking nodejs to fetch 3milllion documents from your db and store the whole thing in one humongosaurus array in memory. Hope you have a high-spec computer. Also, it won't write anything to a file unless the whole thing is in memory, if that ever happens.
Instead, I would be iterating the cursor so that at any given point I only have 1 doc in memory and I pipe that to the writestream.
Since you are reading 3 million documents, it is going to take time, it's just a matter of how much. At least with this approach, you're going to be writing the documents in the csv file as they're being read, instead of writing the whole thing at once from memory.
I wrote a small script to iterate over the whole thing via cursor.next()
, document-by-document, and tested it.
const { MongoClient } = require('mongodb');
const csv = require('fast-csv');
const fs = require('fs');
// or as an es module:
// import { MongoClient } from 'mongodb'
// Connection URL
const url = 'supersecretmongourlhere';
const client = new MongoClient(url);
// Database Name
const dbName = 'db_name_here';
const run = async() => {
try{
await client.connect();
console.log("db connected");
const db = client.db(dbName);
console.time("X")
const pipelineStages = [{
$match: {
// your pipeline here
}
},{
$sort: {
_id: -1
}
},{
$limit: 50000
}]
const cursor = db.collection('collection_name_here').aggregate(pipelineStages)
const csvStream = csv.format({ headers: true });
const writeStream = fs.createWriteStream('./myfile.csv');
csvStream.pipe(writeStream).on('end',() => {
console.log("DONE");
}).on('error',err => console.error(err));
while(await cursor.hasNext()){
const doc = await cursor.next();
csvStream.write(doc);
console.log(`${doc._id} written`);
}
console.log('done')
console.timeEnd("X")
csvStream.end();
writeStream.end();
} catch(e) {
console.error(e);
}
}
run();
For 50000 documents, 1:17.987s = 77987ms, 50000/77987 = 1.56ms/document. 1.56*3000000 = 4680s = ~78mins
Is that acceptable in your use case?
Yes, I know this isn't a very fast solution but it will work.
There might be some way to speed it up further by fetching docs in batches from the db, will look into it if I can.