mongodb: insert if not exists
Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.
- I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
- I don't want to have duplicate documents.
- I don't want to remove a document which has previously been saved, but is not in my update.
- 95% (estimated) of the records are unmodified from day to day.
I am using the Python driver (pymongo).
What I currently do is (pseudo-code):
for each document in update:
existing_document = collection.find_one(document)
if not existing_document:
document['insertion_date'] = now
else:
document = existing_document
document['last_update_date'] = now
my_collection.save(document)
My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )
Can someone advise how to do it faster?
Solution 1:
Sounds like you want to do an "upsert". MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:
key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument
This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.
Before:
{"key":"value", "key2":"Ohai."}
After:
{"key":"value", "key2":"value2", "key3":"value3"}
You can also specify what data you want to write:
data = {"$set":{"key2":"value2"}}
Now your selected document will update the value of "key2" only and leave everything else untouched.
Solution 2:
As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)
Set 'insertion_date' using $setOnInsert and 'last_update_date' using $set in your upsert command.
To turn your pseudocode into a working example:
now = datetime.utcnow()
for document in update:
collection.update_one(
filter={
'_id': document['_id'],
},
update={
'$setOnInsert': {
'insertion_date': now,
},
'$set': {
'last_update_date': now,
},
},
upsert=True,
)
Solution 3:
You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:
> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13}) # This works
> db.getCollection("test").insert({a:1, b:12, c:13}) # This fails
E11000 duplicate key error index: foo.test.$a_1 dup key: { : 1.0 }
Solution 4:
You may use Upsert with $setOnInsert operator.
db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})