Simple explanation of MapReduce?

Related to my CouchDB question.

Can anyone explain MapReduce in terms a numbnuts could understand?


Solution 1:

Going all the way down to the basics for Map and Reduce.


Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.

suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say

A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2

and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write

A = [1, 2, 3].Map(x => x * 2)

the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.

1 : 1 => 1 * 2 : 2  
2 : 2 => 2 * 2 : 4  
3 : 3 => 3 * 2 : 6  

so after executing the map function with (x => x * 2) you'd have [2, 4, 6].


Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.

Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this

A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]

But, if you have access to a reduce function, you could write it like this

A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )

Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.

the execution would follows:

result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24

But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result = line.

say you want to sum 2 lists, it might look like this:

A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )

or a version you'd more likely to find in the real world:

A = [7, 8, 9]
B = [1, 2, 3]

sum_func = (x, y) => x + y
sum = A.reduce( B.reduce( 0, sum_func ), sum_func )

Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.

You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.

There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.

Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.

Solution 2:

MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any other code other than the mapper and reduce functions.

The map function takes data in and churns out a result, which is held in a barrier. This function can run in parallel with a large number of the same map task. The dataset can then be reduced to a scalar value.

So if you think of it like a SQL statement

SELECT SUM(salary)
FROM employees
WHERE salary > 1000
GROUP by deptname

We can use map to get our subset of employees with salary > 1000 which map emits to the barrier into group size buckets.

Reduce will sum each of those groups. Giving you your result set.

just plucked this from my university study notes of the google paper