Python multiprocessing pool hangs at join?
I'm trying to run some python code on several files in parallel. The construct is basically:
def process_file(filename, foo, bar, baz=biz):
# do stuff that may fail and cause exception
if __name__ == '__main__':
# setup code setting parameters foo, bar, and biz
psize = multiprocessing.cpu_count()*2
pool = multiprocessing.Pool(processes=psize)
map(lambda x: pool.apply_async(process_file, (x, foo, bar), dict(baz=biz)), sys.argv[1:])
pool.close()
pool.join()
I've previously used pool.map to do something similar and it worked great, but I can't seem to use that here because pool.map doesn't (appear to) allow me to pass in extra arguments (and using lambda to do it won't work because lambda can't be marshalled).
So now I'm trying to get things to work using apply_async() directly. My issue is that the code seems to hang and never exit. A few of the files fail with an exception, but i don't see why what would cause join to fail/hang? Interestingly if none of the files fail with an exception, it does exit cleanly.
What am I missing?
Edit: When the function (and thus a worker) fails, I see this exception:
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 376, in _handle_results
task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())
If i see even one of these, the process parent process hangs forever, never reaping the children and exiting.
Sorry to answer my own question, but I've found at least a workaround so in case anyone else has a similar issue I want to post it here. I'll accept any better answers out there.
I believe the root of the issue is http://bugs.python.org/issue9400 . This tells me two things:
- I'm not crazy, what I'm trying to do really is supposed to work
- At least in python2, it is very difficult if not impossible to pickle 'exceptions' back to the parent process. Simple ones work, but many others don't.
In my case, my worker function was launching a subprocess that was segfaulting. This returned CalledProcessError exception, which is not pickleable. For some reason, this makes the pool object in the parent go out to lunch and not return from the call to join().
In my particular case, I don't care what the exception was. At most I want to log it and keep going. To do this, I simply wrap my top worker function in a try/except clause. If the worker throws any exception, it is caught before trying to return to the parent process, logged, and then the worker process exits normally since it's no longer trying to send the exception through. See below:
def process_file_wrapped(filenamen, foo, bar, baz=biz):
try:
process_file(filename, foo, bar, baz=biz)
except:
print('%s: %s' % (filename, traceback.format_exc()))
Then, I have my initial map function call process_file_wrapped() instead of the original one. Now my code works as intended.
You can actually use a functools.partial
instance instead of a lambda
in cases where the object needs to be pickled. partial
objects are pickleable since Python 2.7 (and in Python 3).
pool.map(functools.partial(process_file, x, foo, bar, baz=biz), sys.argv[1:])
For what it's worth, I had a similar bug (not the same) when pool.map
hung. My use case allowed me to use pool.terminate to solve it (make sure yours does as well before changing stuff).
I used pool.map before calling terminate
so I know everything finished, from the docs:
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
If that's your use case this may be a way to patch it.