faster mongoimport, in parallel in airflow?

tl;dr: there seems to be a limit on how fast data is inserted into our mongodb atlas cluster. Inserting data in parallel does not speed this up. How can we speed this up? Is our only option to get a larger mongodb atlas cluster with more Write IOPS? What even are write IOPS?

We replace and re-insert >10GB+ of data daily into our mongodb cluster with atlas. We have the following 2 bash commands, wrapped in python functions to help parameterize the commands, that we use with BashOperator in airflow:

upload single JSON to mongo cluster

def mongoimport_file(mongo_table, file_name):
    # upload single file from /tmp directory into Mongo cluster
    # cleanup: remove .json in /tmp at the end
    uri = 'mongodb+srv://<user>:<pass>@our-cluster.dwxnd.gcp.mongodb.net/ourdb'
    return f"""
        echo INSERT \
        && mongoimport --uri "{uri}" --collection {mongo_table} --drop --file /tmp/{file_name}.json \
        && echo AND REMOVE LOCAL FILEs... \
        && rm /tmp/{file_name}.json
    """

upload directory of JSONs to mongo cluster

def mongoimport_dir(mongo_table, dir_name):
    # upload directory of JSONs into mongo cluster
    # cleanup: remove directory at the end
    uri = 'mongodb+srv://<user>:<pass>@our-cluster.dwxnd.gcp.mongodb.net/ourdb'
    return f"""
        echo INSERT \
        && cat /tmp/{dir_name}/*.json | mongoimport --uri "{uri}" --collection {mongo_table} --drop \
        && echo AND REMOVE LOCAL FILEs... \
        && rm -rf /tmp/{dir_name}
    """

There are called in airflow using the BashOperator:

import_to_mongo = BashOperator(
    task_id=f'mongo_import_v0__{this_table}',
    bash_command=mongoimport_file(mongo_table = 'tname', file_name = 'fname')
)

Both of these work, although with varying performance:

mongoimport_file with 1 5GB file: takes ~30 minutes to mongoimport
mongoimport_dir with 100 50MB files: takes ~1 hour to mongoimport

There is currently no parallelization with ** mongoimport_dir**, and in fact it is slower than importing just a single file.

Within airflow, is it possible to parallelize the mongoimport of our directory of 100 JSONs, to achieve a major speedup? If there's a parallel solution using python's pymongo that cannot be done with mongoimport, we're happy to switch (although we'd strongly prefer to avoid loading these JSONs into memory).
What is the current bottleneck with importing to mongo? Is it (a) CPUs in our server / docker container, or (b) something with our mongo cluster configuration (cluster RAM, or cluster vCPU, or cluster max connections, or cluster read / write IOPS (what are these even?)). For reference, here is our mongo config. I assume we can speed up our import by getting a much bigger cluster but mongodb atlas becomes very expensive very fast. 0.5 vCPUs doesn't sound like much, but this already runs us $150 / month...

enter image description here

First of all "What is the current bottleneck with importing to mongo?" and "Is it (a) CPUs in our server / docker container " - don't believe to anyone who will tell you the answer from the screenshot you provided.

Atlas has monitoring tools that will tell you if the bottleneck is in CPU, RAM, disk or network or any combination of those on db side:

enter image description here

On the client side (airflow) - please use system monitor of your host OS to answer the question. Test disk I/O inside docker. Some combinations of host OS and docker storage drivers performed quite poor in the past.

Next, "What even are write IOPS" - random write operations per second https://cloud.google.com/compute/docs/disks/performance

IOPS calculation differs depending on cloud provider. Try AWS and Azure to compare cost vs speed. M10 on AWS gives you 2 vCPU, yet again I doubt you can compare them 1:1 between vendors. The good thing is it's on-demand and will cost you less than a cup of coffee to test and delete the cluster.

Finally, "If there's a parallel solution using python's pymongo" - I doubt so. mongoimport uses batches of 100,000 documents, so essentially it sends it as fast as the stream is consumed on the receiver. The limitations on the client side could be: network, disk, CPU. If it is network or disk, parallel import won't improve a thing. Multi-core systems could benefit from parallel import if mongoimport was using a single CPU and it was the limiting factor. By default mongoimport uses all CPUs available: https://github.com/mongodb/mongo-tools/blob/cac1bfbae193d6ba68abb764e613b08285c6f62d/common/options/options.go#L302. You can hardly beat it with pymongo.

faster mongoimport, in parallel in airflow?

Related

Recent Posts