Using map/reduce for mapping the properties in a collection
Update: follow-up to MongoDB Get names of all keys in collection.
As pointed out by Kristina, one can use Mongodb 's map/reduce to list the keys in a collection:
db.things.insert( { type : ['dog', 'cat'] } );
db.things.insert( { egg : ['cat'] } );
db.things.insert( { type : [] });
db.things.insert( { hello : [] } );
mr = db.runCommand({"mapreduce" : "things",
"map" : function() {
for (var key in this) { emit(key, null); }
},
"reduce" : function(key, stuff) {
return null;
}})
db[mr.result].distinct("_id")
//output: [ "_id", "egg", "hello", "type" ]
As long as we want to get only the keys located at the first level of depth, this works fine. However, it will fail retrieving those keys that are located at deeper levels. If we add a new record:
db.things.insert({foo: {bar: {baaar: true}}})
And we run again the map-reduce +distinct snippet above, we will get:
[ "_id", "egg", "foo", "hello", "type" ]
But we will not get the bar and the baaar keys, which are nested down in the data structure. The question is: how do I retrieve all keys, no matter their level of depth? Ideally, I would actually like the script to walk down to all level of depth, producing an output such as:
["_id","egg","foo","foo.bar","foo.bar.baaar","hello","type"]
Thank you in advance!
OK, this is a little more complex because you'll need to use some recursion.
To make the recursion happen, you'll need to be able to store some functions on the server.
Step 1: define some functions and put them server-side
isArray = function (v) {
return v && typeof v === 'object' && typeof v.length === 'number' && !(v.propertyIsEnumerable('length'));
}
m_sub = function(base, value){
for(var key in value) {
emit(base + "." + key, null);
if( isArray(value[key]) || typeof value[key] == 'object'){
m_sub(base + "." + key, value[key]);
}
}
}
db.system.js.save( { _id : "isArray", value : isArray } );
db.system.js.save( { _id : "m_sub", value : m_sub } );
Step 2: define the map and reduce functions
map = function(){
for(var key in this) {
emit(key, null);
if( isArray(this[key]) || typeof this[key] == 'object'){
m_sub(key, this[key]);
}
}
}
reduce = function(key, stuff){ return null; }
Step 3: run the map reduce and look at results
mr = db.runCommand({"mapreduce" : "things", "map" : map, "reduce" : reduce,"out": "things" + "_keys"});
db[mr.result].distinct("_id");
The results you'll get are:
["_id", "_id.isObjectId", "_id.str", "_id.tojson", "egg", "egg.0", "foo", "foo.bar", "foo.bar.baaaar", "hello", "type", "type.0", "type.1"]
There's one obvious problem here, we're adding some unexpected fields here: 1. the _id data 2. the .0 (on egg and type)
Step 4: Some possible fixes
For problem #1 the fix is relatively easy. Just modify the map
function. Change this:
emit(base + "." + key, null); if( isArray...
to this:
if(key != "_id") { emit(base + "." + key, null); if( isArray... }
Problem #2 is a little more dicey. You wanted all keys and technically "egg.0" is a valid key. You can modify m_sub
to ignore such numeric keys. But it's also easy to see a situation where this backfires. Say you have an associative array inside of a regular array, then you want that "0" to appear. I'll leave the rest of that solution up to you.
With Gates VP's and Kristina's answers as inspiration, I created an open source tool called Variety which does exactly this: https://github.com/variety/variety
Hopefully you'll find it to be useful. Let me know if you have questions, or any issues using it.
I solved problem #2 stated by Gates where for example data.0, data.1, data.2 was returned. Even though these are valid keys as stated above, I wanted to get rid of them for presentation purposes. I solved it by a quick edit in the m_sub function as shown below.
const m_sub = function (base, value) {
for (var key in value) {
if(key != "_id" && isNaN(key)){
emit(base + "." + key, null);
if (isArray(value[key]) || typeof value[key] == 'object') {
m_sub(base + "." + key, value[key]);
}
}
}
This change also has the above solution for problem #1 implemented and the only change made is in the first if-statement where I changed this:
if(key != "_id")
To this using the isNaN(x) function:
if(key != "_id" && isNaN(key))
Hope this helps someone, and if there is a problem with this solution please give feedback!