Poor scaling of multiprocessing Pool.map() on a list of large objects: How to achieve better parallel scaling in python?
your work function ends too soon:
In [2]: %timeit func(1)
335 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
so you are basically measuring the overhead of multiprocessing.
change your work function to do more work, like loop 1000 * 1000
times rather than 1000
times, you will see it scales again, 1000000
loops cost roughly 0.4s
on my mac, which high enough compared to the overhead.
below is the test result for different n
on my mac, I use Pool(4)
as I have 4 cores, test runs only once rather than multi times like %timeit
, cause the difference is insignificant:
you could see the speedup ratio is increasing proportionally with n
, the overhead of multiprocessing is shared by each work function call.
the math behind, assume per-call overhead is equal:
if we want ratio > 1
:
approximately equal:
which means, if work function runs too fast compares with per-call overhead, multiprocessing
does not scale.
Before we start
and dive deeper into any hunt for nanoseconds ( and right, it will soon start, as each [ns]
matters as the scaling opens the whole Pandora Box of the problems ), lets agree on the scales - most easy and often "cheap" premature tricks may and often will derail your dreams once the scales of the problem size have grown into realistic scales - the thousands (seen above in both iterators) behave way different for in-cache computing with < 0.5 [ns]
data-fetches, than once having grown beyond the L1/L2/L3-cache-sizes for scales above 1E+5, 1E+6, 1E+9,
above [GB]
s, where each mis-aligned fetch is WAY more EXPENSIVE, than a few 100 [ns]
Q : "... because I have 8 cores, I want to use them to get 8 times faster"
I wish you could, indeed. Yet, sorry for telling the truth straight, the World does not work this way.
See this interactive tool, it will show you both the speedup limits and their principal dependence on the actual production costs of the real-world scaling of the initial problem, as it grows from trivial sizes and these combined effects at scale just click-it and play with the sliders to see it live, in action :
Q : (is)
Pool.map()
really passing the content of those big sub list around processes which cause additional copy?
Yes,
it must do so, by design
plus it does that by passing all that data "through" another "expensive" SER/DES processing,
so as to make it happen delivered "there".
The very same would apply vice-versa whenever you would have tried to return "back" some mastodon-sized result(s), which you did not, here above.
Q : But if so, why doesn't it passing the address of sub-list?
Because the remote ( parameter-receiving ) process is another, fully autonomous process, with its own, separate and protected, address-space we cannot just pass an address-reference "into", and we wanted that to be a fully independent, autonomously working python process ( due to a will to use this trick so as to escape from GIL-lock dancing ), didn't we? Sure we did - this is a central step of our escape from the GIL-Wars ( for better understanding of the GIL-lock pros and cons, may like this and this ( Pg.15+ on CPU-bound processing ).
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with a Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
?~~~ 2,500,000 ns - Read 10 MB sequentially from MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
?~~ 25,000,000 ns - Read 100 MB sequentially from MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
30,000,000 ns - Read 1 MB sequentially from a DISK
?~~ 36,000,000 ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Q : " what is the correct way to keep parallel scaling when parallel mapping some operations on a list of large things? "
A )
UNDERSTAND THE WAYS TO AVOID OR AT LEAST REDUCE EXPENSES :
Understand all the types of the costs you have to pay and will pay :
-
spend as low process instantiation costs as possible (rather expensive ) best as a one-time cost only
On macOS, the
spawn
start method is now the default. Thefork
start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725. spend as small amount of costs of parameter-passing as you must ( yes, best avoid repetitive passing those "large things" as parameters )
- never waste resources on things that do not perform your job - ( never spawn more processes than was reported by
len( os.sched_getaffinity( 0 ) )
- any process more than this will but wait for its next CPU-core-slot, and will but evict other, cache-efficient process, thus re-paying all the fetch-costs once already paid to re-fetch again all data so to camp-em back in-cache for a soon to get evicted again in-cache computing, while those processes that worked so far this way were right evicted (for what good?) by a naive use of as many asmultiprocessing.cpu_count()
-reported processes, so expensively spawned in the initialPool
-creation )
- better re-use a pre-allocated memory, than keep spending ad-hoc memory allocation costs ALAP
- never share a bit, if The Performance is the goal
- never block, never - be it python
gc
which may block if not avoided, orPool.map()
which blocks either
B )
UNDERSTAND THE WAYS TO INCREASE THE EFFICIENCY :
Understand all efficiency increasing tricks, even at a cost of complexity of code ( a few SLOC-s are easy to show in school-books, yet sacrificing both the efficiency and the performance - in spite of these both being your main enemy in a fight for a sustainable performance throughout the scaling ( of either of problem size or iteration depths, or when growing both of them at the same time ).
Some categories of the real-world costs from A ) have dramatically changed the limits of the theoretically achievable speedups to be expected from going into some form of [PARALLEL]
process orchestrations ( here, making some parts of the code-execution got executed in the spawned sub-processes ), the initial view of which was first formulated by Dr. Gene Amdahl as early as 60+ years ago ( for which there were recently added two principal extensions of both the process instantiation(s) related setup + termination add on costs ( extremely important in py2 always & py3.5+ for MacOS and Windows ) and an atomicity-of-work
, which will be discussed below.
Overhead-strict re-formulation of the Amdahl's Law speedup S:
S = speedup which can be achieved with N processors
s = a proportion of a calculation, which is [SERIAL]
1-s = a parallelizable portion, that may run [PAR]
N = a number of processors ( CPU-cores ) actively participating on [PAR] processing
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on cost/latency
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on cost/latency
N
Overhead-strict and resources-aware re-formulation:
1 where s, ( 1 - s ), N
S = ______________________________________________ ; pSO, pTO
| ( 1 - s ) | were defined above
s + pSO + max| _________ , atomicP | + pTO atomicP:= a unit of work,
| N | further indivisible,
a duration of an
atomic-process-block
Prototype on target CPU/RAM device with your python, scaled >>1E+6
Any simplified mock-up example will somehow skew your expectations about how the actual workloads will perform in-vivo. Underestimated RAM-allocations, not seen at small-scales may later surprise at scale, sometimes even throwing the operating system into sluggish states, swapping and thrashing. Some smarter tools ( numba.jit()
) may even analyze the code and shortcut some passages of code, that will never be visited or that does not produce any result, so be warned that simplified examples may lead to surprising observations.
from multiprocessing import Pool
import numpy as np
import os
SCALE = int( 1E9 )
STEP = int( 1E1 )
aLIST = np.random.random( ( 10**3, 10**4 ) ).tolist()
#######################################################################################
# func() does some SCALE'd amount of work, yet
# passes almost zero bytes as parameters
# allocates nothing, but iterator
# returns one byte,
# invariant to any expensive inputs
def func( x ):
for i in range( SCALE ):
i**2
return 1
A few hints on making the strategy of scaling less overhead-costs expensive :
#####################################################################################
# more_work_en_block() wraps some SCALE'd amount of work, sub-list specified
def more_work_en_block( en_block = [ None, ] ):
return [ func( nth_item ) for nth_item in en_block ]
If indeed must pass a big list, better pass larger block, with remote-iterating its parts ( instead of paying transfer-costs for each and every item passed many many more times, than if using sub_blocks
( parameters get SER/DES processed ( ~ the costs of pickle.dumps()
+ pickle.loads()
) [per-each-call], again, at an add-on costs, that decrease the resulting efficiency and worsen the overheads part of the extended, overhead-strict Amdahl's Law )
#####################################################################################
# some_work_en_block() wraps some SCALE'd amount of work, tuple-specified
def some_work_en_block( sub_block = ( [ None, ], 0, 1 ) ):
return more_work_en_block( en_block = sub_block[0][sub_block[1]:sub_block[2]] )
Right-sizing the number of process-instances :
aMaxNumOfProcessesThatMakesSenseToSPAWN = len( os.sched_getaffinity( 0 ) ) # never more
with Pool( aMaxNumOfProcessesThatMakesSenseToSPAWN ) as p:
p.imap_unordered( more_work_en_block, [ ( aLIST,
start,
start + STEP
)
for start in range( 0, len( aLIST ), STEP ) ] )
Last but not least, expect immense performance boosts from smart use of numpy
smart vectorised code, best without repetitive passing of static, pre-copied (during the process instantiation(s), thus paid as the reasonably scaled, here un-avoidable, cost of thereof ) BLOBs, used in the code without passing the same data via parameter-passing, in a vectorised ( CPU-very-efficient ) fashion as read-only data. Some examples on how one can make ~ +500 x
speedup one may read here or here, about but ~ +400 x
speedup or about a case of just about a ~ +100 x
speedup, with some examples of some problem-isolation testing scenarios.
Anyway, the closer will the mock-up code be to your actual workloads, the more sense the benchmarks will get to have ( at scale & in production ).
Good luck on exploring the World, as it is,
not as a dream if it were different,
not as a wish it were different or that we would like it to be
:o)
Facts and Science matter - both + together
Records of Evidence are the core steps forwards to achieve as high performance as possible,
not any Product Marketing,
not any Evangelisation Clans wars,
not any Blog-posts' chatter
At least don't say you was not warned
:o)