Simple Java Map/Reduce framework [closed]

Have you check out Akka? While akka is really a distributed Actor model based concurrency framework, you can implement a lot of things simply with little code. It's just so easy to divide work into pieces with it, and it automatically takes full advantage of a multi-core machine, as well as being able to use multiple machines to process work. Unlike using threads, it feels more natural to me.

I have a Java map reduce example using akka. It's not the easiest map reduce example, since it makes use of futures; but it should give you a rough idea of what's involved. There are several major things that my map reduce example demonstrates:

  • How to divide the work.
  • How to assign the work: akka has a really simple messaging system was well as a work partioner, whose schedule you can configure. Once I learned how to use it, I couldn't stop. It's just so simple and flexible. I was using all four of my CPU cores in no time. This is really great for implementing services.
  • How to know when the work is done and the result is ready to process: This is actually the portion that may be the most difficult and confusing to understand unless you're already familiar with Futures. You don't need to use Futures, since there are other options. I just used them because I wanted something shorter for people to grok.

If you have any questions, StackOverflow actually has an awesome akka QA section.


I think it is worth mentioning that these problems are history as of Java 8. An example:

int heaviestBlueBlock =
    blocks.filter(b -> b.getColor() == BLUE)
          .map(Block::getWeight)
          .reduce(0, Integer::max);

In other words: single-node MapReduce is available in Java 8.

For more details, see Brian Goetz's presentation about project lambda


I use the following structure

int procs = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(procs);

List<Future<TaskResult>> results = new ArrayList();
for(int i=0;i<tasks;i++)
    results.add(es.submit(new Task(i)));
for(Future<TaskResult> future:results)
    reduce(future);

I realise this might be a little after the fact but you might want to have a look at the JSR166y ForkJoin classes from JDK7.

There is a back ported library that works under JDK6 without any issues so you don't have to wait until the next millennium to have a go with it. It sits somewhere between an raw executor and hadoop giving a framework for working on map reduce job within the current JVM.