Joblib Parallel multiple cpu's slower than single
I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.
My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.
Is this a poor example or use of joblib? Did I simply structure the code wrong?
Here is the example:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
b = zip(np.random.rand(300,2),np.random.rand(300,2))
## Check if one line segment contains another.
def check_paths(path, paths):
for other_path in paths:
res='no cross'
chck = Path(other_path)
if chck.contains_path(path)==1:
res= 'cross'
break
return res
res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)
Solution 1:
In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel
. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.
In long, here are my timings with your code:
On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs
:
For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec
So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.
You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy a
for each element in b
. That is ineffective. So instead access the global a
. (Parallel
will fork the process, copying all global variables to the newly spawned processes, so a
is accessible). This gives me the following code (with timing and main loop guard as the documentation of joblib
recommends:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys
## Check if one line segment contains another.
def check_paths(path):
for other_path in a:
res='no cross'
chck = Path(other_path)
if chck.contains_path(path)==1:
res= 'cross'
break
return res
if __name__ == '__main__':
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
b = zip(np.random.rand(300,2),np.random.rand(300,2))
now = time.time()
if len(sys.argv) >= 2:
res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
else:
res = [check_paths(Path(points)) for points in b]
print "Finished in", time.time()-now , "sec"
Timing results:
n_jobs=1: Finished in 34.2845709324 sec
n_jobs=2: Finished in 16.6254048347 sec
n_jobs=3: Finished in 11.219119072 sec
n_jobs=4: Finished in 8.61683392525 sec
n_jobs=5: Finished in 8.51907801628 sec
n_jobs=6: Finished in 8.21842098236 sec
n_jobs=7: Finished in 8.21816396713 sec
n_jobs=8: Finished in 7.81841087341 sec
The saturation now slightly moved to n_jobs=4
which is the value to be expected.
check_paths
does several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=a
the line Path(...)
is executed in every call. Precalculate that. Secondly the string res='no cross'
is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys
## Check if one line segment contains another.
def check_paths(path):
#global a
#print(path, a[:10])
res='no cross'
for other_path in a:
if other_path.contains_path(path)==1:
res= 'cross'
break
return res
if __name__ == '__main__':
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
a = [Path(x) for x in a]
b = zip(np.random.rand(300,2),np.random.rand(300,2))
now = time.time()
if len(sys.argv) >= 2:
res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
else:
res = [check_paths(Path(points)) for points in b]
print "Finished in", time.time()-now , "sec"
with timings:
n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec
A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_path
will only return True
if this path completely contains the given path.
(see documentation). Therefore your function will basically always return no cross
given the random input.
Solution 2:
In addition to the above answer, and for future reference, there are two aspects to this question, and joblib's recent evolutions helps with both.
Parallel pool creation overhead: The problem here is that creating a parallel pool is costly. It's was especially costly here, as the code not protected by the "main" was run in each job at creation of the Parallel object. In the latest joblib (still beta), Parallel can be used as a context manager to limit the number of time a pool is created, and thus the impact of this overhead.
Dispatching overhead: it is important to keep in mind that dispatching an item of the for loop has an overhead (much bigger than iterating a for loop without parallel). Thus, if these individual computation items are very fast, this overhead will dominate the computation. In the latest joblib, joblib will trace the execution time of each job and start bunching them if they are very fast. This strongly limits the impact of the dispatch overhead in most cases (see the PR for bench and discussion).
Disclaimer: I am the original author of joblib (just saying to warn against potential conflicts of interest in my answer, although here I think that it is irrelevant).