Is Mongodb Aggregation framework faster than map/reduce?
Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce?
If yes, why and how and how much?
(Already I have done a test for myself, and the performance was nearly same)
Solution 1:
Every test I have personally run (including using your own data) shows aggregation framework being a multiple faster than map reduce, and usually being an order of magnitude faster.
Just taking 1/10th of the data you posted (but rather than clearing OS cache, warming the cache first - because I want to measure performance of the aggregation, and not how long it takes to page in the data) I got this:
MapReduce: 1,058ms
Aggregation Framework: 133ms
Removing the $match from aggregation framework and {query:} from mapReduce (because both would just use an index and that's not what we want to measure) and grouping the entire dataset by key2 I got:
MapReduce: 18,803ms
Aggregation Framework: 1,535ms
Those are very much in line with my previous experiments.
Solution 2:
My benchmark:
== Data Generation ==
Generate 4million rows (with python) easy with approximately 350 bytes. Each document has these keys:
- key1, key2 (two random columns to test indexing, one with cardinality of 2000, and one with cardinality of 20)
- longdata: a long string to increase size of each document
- value: a simple number (const 10) to test aggregation
Total data size was about 6GB in mongo. (and 2GB in postgres)
db = Connection('127.0.0.1').test # mongo connection
random.seed(1)
for _ in range(2):
key1s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(10)]
key2s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(1000)]
baddata = 'some long date ' + '*' * 300
for i in range(2000):
data_list = [{
'key1': random.choice(key1s),
'key2': random.choice(key2s),
'baddata': baddata,
'value': 10,
} for _ in range(1000)]
for data in data_list:
db.testtable.save(data)
== Tests ==
I did some test, but one is enough to comparing results:
NOTE: Server is restarted, and OS cache is cleaned after each query, to ignore effect of caching.
QUERY: aggregate all rows with key1=somevalue
(about 200K rows) and sum value
for each key2
- map/reduce 10.6 sec
- aggreate 9.7 sec
- group 10.3 sec
queries:
map/reduce:
db.testtable.mapReduce(function(){emit(this.key2, this.value);}, function(key, values){var i =0; values.forEach(function(v){i+=v;}); return i; } , {out:{inline: 1}, query: {key1: '663969462d2ec0a5fc34'} })
aggregate:
db.testtable.aggregate({ $match: {key1: '663969462d2ec0a5fc34'}}, {$group: {_id: '$key2', pop: {$sum: '$value'}} })
group:
db.testtable.group({key: {key2:1}, cond: {key1: '663969462d2ec0a5fc34'}, reduce: function(obj,prev) { prev.csum += obj.value; }, initial: { csum: 0 } })