Python Multiprocessing combined with Multithreading

Solution 1:

1) Is this possible?

Yes.


2) Is there any point into it?

Yes. But generally not the point you're looking for.

First, just about every modern operating system uses a "flat" scheduler; there's no difference between 8 threads scattered across 3 programs or 8 thread across 8 programs.*

* Some programs can get a significant benefit by carefully using intraprocess-only locks or other synchronization primitives in some places where you know you're only sharing with threads from the same program—and, of course, by avoiding shared memory in those places—but you're not going to get that benefit by spreading your jobs across threads and your threads across processes evenly.

Second, even if you were using, say, old SunOS, in the default CPython interpreter, the Global Interpreter Lock (GIL) ensures that only one thread can be running Python code at a time. If you're spending your time running code from a C extension library that explicitly releases the GIL (like some NumPy functions), threads can help, but otherwise, they all just end up serialized anyway.

The main case where threads and processes are useful together is where you have both CPU-bound and I/O-bound work. In that case, usually one is feeding the other. If the I/O feeds the CPU, use a single thread pool in the main process to handle the I/O, then use a pool of worker processes to do the CPU work on the results. If it's the other way around, use a pool of worker processes to do the CPU work, then have each worker process use a thread pool to do the I/O.


3) This is my code but it hangs when i try to join the processes.

It's very hard to debug code when you don't give a minimal, complete, verifiable example.

However, I can see one obvious problem.

You're trying to use TQ as a producer-consumer queue, with t1 and t2 as producers and the filePro parent as the consumer. Your consumer doesn't call TQ.task_done() until after t1.join() and t2.join() return, which doesn't happen until those threads are done. But those producers won't finish because they're waiting for you to call TQ.task_done(). So, you've got a deadlock.

And, because each of your child processes' main threads are deadlocked, they're never finish, so the p1.join() will block forever.

If you really want the main thread to wait until the other threads are done before doing any work, you don't need the producer-consumer idiom; just let the children do their work and exit without calling TQ.join(), and don't bother with TQ.task_done() in the parent. (Note that you're already doing this correctly with PQ.)

If, on the other hand, you want them to work in parallel, don't try to join the child threads until you've finished your loop.

Solution 2:

I compared behaviour of the following 3 approaches on a IO+CPU and strictly CPU expensive blocking task:

  • multiprocessing only
  • multithreading only
  • both combined using fast_map function

Results for IO+CPU expensive tasks show significant speed improvement when combination of multiprocessing and multithreading is used. "-1" indicates that the ProcessPoolExecutor failed due to "too many files" open.

enter image description here

Results for strictly CPU expensive tasks show that multiprocessing itself is slightly faster.

enter image description here

fast_map function spawns a process for each cpu-core*2 and creates sufficient number of threads in each process to achieve full concurrency (unless threads_limit argument is supplied). Source code, testing code are more information is available from the fast_map GitHub page.

If someone wants to play around with it or just practically use it, it can be obtained with:

python3 -m pip install fast_map

And used like:

from fast_map import fast_map
import time

def wait_and_square(x):
    time.sleep(1)
    return x*x

for i in fast_map(wait_and_square, range(8), threads_limit=None):
    print(i)