Using the multiprocessing module for cluster computing

I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.

What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?

If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.


Solution 1:

If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.

What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).

See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.

From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).

If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.

Solution 2:

In the past I've used Pyro to do this quite successfully. If you turn on mobile code it will automatically send over the wire required modules the nodes don't have already. Pretty nifty.

Solution 3:

I have luck using SCOOP as an alternative to multiprocessing for single or multi computer use and gain the benefit of job submission for clusters as well as many other features such as nested maps and minimal code changes to get working with map().

The source is available on Github. A quick example shows just how simple implementation can be!

Solution 4:

If you are willing to pip install an open source package, you should consider Ray, which out of the Python cluster frameworks is probably the option that comes closest to the single threaded Python experience. It allows you to parallelize both functions (as tasks) and also stateful classes (as actors) and does all of the data shipping and serialization as well as exception message propagation automatically. It also allows similar flexibility to normal Python (actors can be passed around, tasks can call other tasks, there can be arbitrary data dependencies, etc.). More about that in the documentation.

As an example, this is how you would do your multiprocessing map example in Ray:

import ray
ray.init()

@ray.remote
def mapping_function(input):
    return input + 1

results = ray.get([mapping_function.remote(i) for i in range(100)])

The API is a little bit different than Python's multiprocessing API, but should be easier to use. There is a walk-through tutorial that describes how to handle data-dependencies and actors, etc.

You can install Ray with "pip install ray" and then execute the above code on a single node, or it's also easy to set up a cluster, see Cloud support and Cluster support

Disclaimer: I'm one of the Ray developers.