Why is linear read-shuffled write not faster than shuffled read-linear write?
I'm currently trying to get a better understanding of memory/cache related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.
With that in mind, I did the following quick-and-dirty test: I wrote a script that creates an array of N random floats and a permutation, i.e. an array containing the numbers 0 to N-1 in random order. Then it repeatedly either (1) reads the data array linearly and writes it back to a new array in the random access pattern given by the permutation or (2) reads the data array in the permuted order and linearly writes it to a new array.
To my surprise (2) seemed consistently faster than (1). There were, however, problems with my script
- The script is written in python/numpy. This being quite a high-level language it is not clear how pecisely the read/write are implemented.
- I probably did not balance the two cases properly.
Also, some of the answers/comments below suggest that my original expectation isn't correct and that depending on details of the cpu cache either case might be faster.
My question is:
- Which (if any) of the two should be faster?
- What are the relvant cache concepts here; how do they influence the result
A beginner-friendly explanation would be appreciated. Any supporting code should be in C / cython / numpy / numba or python.
Optionally:
- Explain why the absolute durations are nonlinear in problem size (cf. timings below).
- Explain the behavior of my clearly inadequate python experiments.
For reference, my platform is Linux-4.12.14-lp150.11-default-x86_64-with-glibc2.3.4
. Python version is 3.6.5.
Here is the code I wrote:
import numpy as np
from timeit import timeit
def setup():
global a, b, c
a = np.random.permutation(N)
b = np.random.random(N)
c = np.empty_like(b)
def fwd():
c = b[a]
def inv():
c[a] = b
N = 10_000
setup()
timeit(fwd, number=100_000)
# 1.4942631321027875
timeit(inv, number=100_000)
# 2.531870319042355
N = 100_000
setup()
timeit(fwd, number=10_000)
# 2.4054739447310567
timeit(inv, number=10_000)
# 3.2365565397776663
N = 1_000_000
setup()
timeit(fwd, number=1_000)
# 11.131387163884938
timeit(inv, number=1_000)
# 14.19817715883255
As pointed out by @Trilarion and @Yann Vernier my snippets aren't properly balanced, so I replaced them with
def fwd():
c[d] = b[a]
b[d] = c[a]
def inv():
c[a] = b[d]
b[a] = c[d]
where d = np.arange(N)
(I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced timeit
with repeat
and reduced the numbers of repeats by a factor of 10.
Then I get
[0.6757169323973358, 0.6705542299896479, 0.6702114241197705] #fwd
[0.8183442652225494, 0.8382121799513698, 0.8173762648366392] #inv
[1.0969422250054777, 1.0725746559910476, 1.0892365919426084] #fwd
[1.0284497970715165, 1.025063106790185, 1.0247828317806125] #inv
[3.073981977067888, 3.077839042060077, 3.072118630632758] #fwd
[3.2967213969677687, 3.2996009718626738, 3.2817375687882304] #inv
So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.
This is a complex problem closely related to architectural features of modern processors and your intuition that random read are slower than random writes because the CPU has to wait for the read data is not verified (most of the time). There are several reasons for that I will detail.
Modern processors are very efficient to hide read latency
while memory writes are more expensive than memory reads
especially in a multicore environment
Reason #1 Modern processors are efficient to hide read latency.
Modern superscalar can execute several instructions simultaneously, and change instruction execution order (out of order execution). While first reason for these features is to increase instruction thoughput, one of the most interesting consequence is the ability of processors to hide latency of memory writes (or of complex operators, branches, etc).
To explain that, let us consider a simple code that copies array into another one.
for i in a:
c[i] = b[i]
One compiled, code executed by the processor will be somehow like that
#1. (iteration 1) c[0] = b[0]
1a. read memory at b[0] and store result in register c0
1b. write register c0 at memory address c[0]
#2. (iteration 2) c[1] = b[1]
2a. read memory at b[1] and store result in register c1
2b. write register c1 at memory address c[1]
#1. (iteration 2) c[2] = b[2]
3a. read memory at b[2] and store result in register c2
3b. write register c2 at memory address c[2]
# etc
(this is terribly oversimplified and the actual code is more complex and has to deal with loop management, address computation, etc, but this simplistic model is presently sufficient).
As said in the question, for reads, the processor has to wait for the actual data. Indeed, 1b need the data fetched by 1a and cannot execute as long as 1a is not completed. Such a constraint is called a dependency and we can say that 1b is dependent on 1a. Dependencies is a major notion in modern processors. Dependencies express the algorithm (eg I write b to c) and must absolutely be respected. But, if there is no dependency between instructions, processors will try to execute other pending instructions in order to keep there operative pipeline always active. This can lead to execution out-of-order, as long as dependencies are respected (similar to the as-if rule).
For the considered code, there no dependency between high level instruction 2. and 1. (or between asm instructions 2a and 2b and previous instructions). Actually the final result would even be identical is 2. is executed before 1., and the processor will try to execute 2a and 2b, before completion of 1a and 1b. There is still a dependency between 2a and 2b, but both can be issued. And similarly for 3a. and 3b., and so on. This is a powerful mean to hide memory latency. If for some reason 2., 3. and 4. can terminate before 1. loads its data, you may even not notice at all any slowdown.
This instruction level parallelism is managed by a set of "queues" in the processor.
a queue of pending instructions in the reservation stations RS (type 128 μinstructions in recent pentiums). As soon as resources required by the instruction is available (for instance value of register c1 for instruction 1b), the instruction can execute.
a queue of pending memory accesses in memory order buffer MOB before the L1 cache. This is required to deal with memory aliases and to insure sequentiality in memory writes or loads at the same address (typ. 64 loads, 32 stores)
a queue to enforce sequentiality when writing back results in registers (reorder buffer or ROB of 168 entries) for similar reasons.
and some other queues at instruction fetch, for μops generation, write and miss buffers in the cache, etc
At one point execution of the previous program there will be many pending stores instructions in RS, several loads in MOB and instructions waiting to retire in the ROB.
As soon as a data becomes available (for instance a read terminates) depending instructions can execute and that frees positions in the queues. But if no termination occurs, and one of these queues is full, the functional unit associated with this queue stalls (this can also happen at instruction issue if the processor is missing register names). Stalls are what creates performance loss and to avoid it, queue filling must be limited.
This explains the difference between linear and random memory accesses.
In a linear access, 1/ the number of misses will be smaller because of the better spatial locality and because caches can prefetch accesses with a regular pattern to reduce it further and 2/ whenever a read terminates, it will concern a complete cache line and can free several pending load instructions limiting the filling of instructions queues. This ways the processor is permanently busy and memory latency is hidden.
For a random access, the number of misses will be higher, and only a single load can be served when data arrives. Hence instructions queues will saturate rapidly, the processor stalls and memory latency can no longer be hidden by executing other instructions.
The processor architecture must be balanced in terms of throughput in order to avoid queue saturation and stalls. Indeed there are be generally tens of instructions at some stage of execution in a processor and global throughput (ie the ability to serve instruction requests by the memory (or functional units)) is the main factor that will determine performances. The fact than some of these pending instructions are waiting for a memory value has a minor effect...
...except if you have long dependency chains.
There is a dependency when an instruction has to wait for the completion of a previous one. Using the result of a read is a dependency. And dependencies can be a problem when involved in a dependency chain.
For instance, consider the code for i in range(1,100000): s += a[i]
. All the memory reads are independent, but there is a dependency chain for the accumulation in s
. No addition can happen until the previous one has terminated. These dependencies will make the reservation stations rapidly filled and create stalls in the pipeline.
But reads are rarely involved in dependency chains. It is still possible to imagine pathological code where all reads are dependent of the previous one (for instance for i in range(1,100000): s = a[s]
), but they are uncommon in real code. And the problem comes from the dependency chain, not from the fact that it is a read; the situation would be similar (and even probably worse) with compute bound dependent code like for i in range(1,100000): x = 1.0/x+1.0
.
Hence, except in some situations, computation time is more related to throughput than to read dependency, thanks to the fact that superscalar out or order execution hides latency. And for what concerns throughput, writes are worse then reads.
Reason #2: Memory writes (especially random ones) are more expensive than memory reads
This is related to the way caches behave. Cache are fast memory that store a part of the memory (called a line) by the processor. Cache lines are presently 64 bytes and allow to exploit spatial locality of memory references: once a line is stored, all data in the line are immediately available. The important aspect here is that all transfers between the cache and the memory are lines.
When a processor performs a read on a data, the cache checks if the line to which the data belongs is in the cache. If not, the line is fetched from memory, stored in the cache and the desired data is sent back to the processor.
When a processor writes a data to memory, the cache also checks for the line presence. If the line is not present, the cache cannot send its data to memory (because all transfers are line based) and does the following steps:
- cache fetches the line from memory and writes it in the cache line.
- data is written in the cache and the complete line is marked as modified (dirty)
- when a line is suppressed from the cache, it checks for the modified flag, and if the line has been modified, it writes it back to memory (write back cache)
Hence, every memory write must be preceded by a memory read to get the line in the cache. This adds an extra operation, but is not very expensive for linear writes. There will be a cache miss and a memory read for the first written word, but successive writes will just concern the cache and be hits.
But the situation is very different for random writes. If the number of misses is important, every cache miss implies a read followed by only a small number of writes before the line is ejected from the cache, which significantly increases write cost. If a line is ejected after a single write, we can even consider that a write is twice the temporal cost of a read.
It is important to note that increasing the number of memory accesses (either reads or writes) tends to saturate the memory access path and to globally slow down all transfers between the processor and memory.
In either case, writes are always more expensive than reads. And multicores augment this aspect.
Reason #3: Random writes create cache misses in multicores
Not sure this really applies to the situation of the question. While numpy BLAS routines are multithreaded, I do not think basic array copy is. But it is closely related and is another reason why writes are more expensive.
The problem with multicores is to ensure proper cache coherence in such a way that a data shared by several processors is properly updated in the cache of every core. This is done by mean of a protocol such as MESI that updates a cache line before writing it, and invalidates other cache copies (read for ownership).
While none of the data is actually shared between cores in the question (or a parallel version of it), note that the protocol applies to cache lines. Whenever a cache line is to be modified, it is copied from the cache holding the most recent copy, locally updated and all other copies are invalidated. Even if cores are accessing different parts of the cache line. Such a situation is called a false sharing and it is an important issue for multicore programming.
Concerning the problem of random writes, cache lines are 64 bytes and can hold 8 int64, and if the computer has 8 cores, every core will process on the average 2 values. Hence there is an important false sharing that will slow down writes.
We did some performance evaluations. It was performed in C in order to include an evaluation of the impact of parallelization. We compared 5 functions that process int64 arrays of size N.
Just a copy of b to c (
c[i] = b[i]
) (implemented by the compiler withmemcpy()
)Copy with a linear index
c[i] = b[d[i]]
whered[i]==i
(read_linear
)Copy with a random index
c[i] = b[a[i]]
wherea
is a random permutation of 0..N-1 (read_random
is equivalent tofwd
in the original question)Write linear
c[d[i]] = b[i]
whered[i]==i
(write_linear
)Write random
c[a[i]] = b[i]
witha
random permutation of 0..N-1 (write_random
is equivalent toinv
in the question)
Code has been compiled with gcc -O3 -funroll-loops -march=native -malign-double
on
a skylake processor. Performances are measured with _rdtsc()
and are
given in cycles per iteration. The function are executed several times (1000-20000 depending on array size), 10 experiments are performed and the smallest time is kept.
Array sizes range from 4000 to 1200000. All code has been measured with a sequential and a parallel version with openmp.
Here is a graph of the results. Functions are with different colors, with the sequential version in thick lines and the parallel one with thin ones.
Direct copy is (obviously) the fastest and is implemented by gcc with
the highly optimized memcpy()
. It is a mean to get an estimation of data throughput with memory. It ranges from 0.8 cycles per iteration (CPI) for small matrices to 2.0 CPI for large ones.
Read linear performances are approximately twice longer than memcpy, but there are 2 reads and a write, vs 1 read and a write for the direct copy. More the index adds some dependency. Min value is 1.56 CPI and max value 3.8 CPI. Write linear is slightly longer (5-10%).
Reads and writes with a random index are the purpose of the original question and deserve a longer comments. Here are the results.
size 4000 6000 9000 13496 20240 30360 45536 68304 102456 153680 230520 345776 518664 777992 1166984
rd-rand 1.86821 2.52813 2.90533 3.50055 4.69627 5.10521 5.07396 5.57629 6.13607 7.02747 7.80836 10.9471 15.2258 18.5524 21.3811
wr-rand 7.07295 7.21101 7.92307 7.40394 8.92114 9.55323 9.14714 8.94196 8.94335 9.37448 9.60265 11.7665 15.8043 19.1617 22.6785
small values (<10k): L1 cache is 32k and can hold a 4k array of uint64. Note, that due to the randomness of the index, after ~1/8 of iterations L1 cache will be completely filled with values of the random index array (as cache lines are 64 bytes and can hold 8 array elements). Accesses to the other linear arrays we will rapidly generate many L1 misses and we have to use the L2 cache. L1 cache access is 5 cycles, but it is pipelined and can serve a couple of values per cycle. L2 access is longer and requires 12 cycles. The amount of misses is similar for random reads and writes, but we see than we fully pay the double access required for writes when array size is small.
medium values (10k-100k): L2 cache is 256k and it can hold a 32k int64 array. After that, we need to go to L3 cache (12Mo). As size increases, the number of misses in L1 and L2 increases and the computation time accordingly. Both algorithms have a similar number of misses, mostly due to random reads or writes (other accesses are linear and can be very efficiently prefetched by the caches). We retrieve the factor two between random reads and writes already noted in B.M. answer. It can be partly explained by the double cost of writes.
large values (>100k): the difference between methods is progressively reduced. For these sizes, a large part of information is stored in L3 cache. L3 size is sufficient to hold a full array of 1.5M and lines are less likely to be ejected. Hence, for writes, after the initial read, a larger number of writes can be done without line ejection, and the relative cost of writes vs read is reduced. For these large sizes, there are also many other factors that need to be considered. For instance, caches can only serve a limited number of misses (typ. 16) and when the number of misses is large, this may be the limiting factor.
One word on parallel omp version of random reads and writes. Except for small sizes, where having the random index array spread over several caches may not be an advantage, they are systematically ~ twice faster. For large sizes, we clearly see that the gap between random reads and writes increases due to false sharing.
It is almost impossible to do quantitative predictions with the complexity of present computer architectures, even for simple code, and even qualitative explanations of the behaviour are difficult and must take into account many factors. As mentioned in other answers, software aspects related to python can also have an impact. But, while it may happen in some situations, most of the time, one cannot consider that reads are more expensive because of data dependency.
- First a refutation of your intuition :
fwd
beatsinv
even without numpy mecanism.
It is the case for this numba version:
import numba
@numba.njit
def fwd_numba(a,b,c):
for i in range(N):
c[a[i]]=b[i]
@numba.njit
def inv_numba(a,b,c):
for i in range(N):
c[i]=b[a[i]]
Timings for N= 10 000:
%timeit fwd()
%timeit inv()
%timeit fwd_numba(a,b,c)
%timeit inv_numba(a,b,c)
62.6 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
144 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16.6 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.9 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- Second, Numpy has to deal with fearsome problems of alignement and (cache-) locality.
It's essentially a wrapper on low level procedures from BLAS/ATLAS/MKL tuned for that. Fancy indexing is a nice high-level tool but heretic for these problems; there is no direct traduction of this concept at low level.
- Third, numpy dev docs : details fancy indexing. In particular:
Unless there is only a single indexing array during item getting, the validity of the indices is checked beforehand. Otherwise it is handled in the inner loop itself for optimization.
We are in this case here. I think this can explain the difference, and why set is slower than get.
It explains also why hand made numba is often faster : it doesn't check anything and crashes on inconsistent index.
Your two NumPy snippets b[a]
and c[a] = b
seem like reasonable heuristics for measuring shuffled/linear read/write speeds, as I'll try to argue by looking at the underlying NumPy code in the first section below.
Regarding the question of which ought to be faster, it seems plausible that shuffled-read-linear-write could typically win (as the benchmarks seem to show), but the difference in speed may be affected by how "shuffled" the shuffled index is, and one or more of:
- The CPU cache read/update policies (write-back vs. write-through, etc.).
- How the CPU chooses to (re)order the instructions it needs to execute (pipelining).
- The CPU recognising memory access patterns and pre-fetching data.
- Cache eviction logic.
Even making assumptions about which policies are in place, these effects are difficult to model and reason about analytically and so I'm not sure a general answer applicable to all processors is possible (although I am not an expert in hardware).
Nevertheless, in the second section below I'll attempt to reason about why the shuffled-read-linear-write is apparently faster, given some assumptions.
"Trivial" Fancy Indexing
The purpose of this section is to go through the NumPy source code to determine if there any obvious explanations for the timings, and also get as clear an idea as possible of what happens when A[B]
or A[B] = C
is executed.
The iteration routine underpinning the fancy-indexing for getitem and setitem operations in this question is "trivial":
-
B
is a single-indexing array with a single stride -
A
andB
have the same memory order (both C-contiguous or both Fortran-contiguous)
Furthermore, in our case both A
and B
are Uint Aligned:
Strided copy code: Here, "uint alignment" is used instead. If the itemsize [N] of an array is equal to 1, 2, 4, 8 or 16 bytes and the array is uint aligned then instead [of using buffering] numpy will do
*(uintN*)dst) = *(uintN*)src)
for appropriate N. Otherwise numpy copies by doingmemcpy(dst, src, N)
.
The point here is that use of an internal buffer to ensure alignment is avoided. The underlying copying implemented with *(uintN*)dst) = *(uintN*)src)
is as straightfoward as "put the X bytes from offset src into the X bytes at offset dst".
Compilers will likely translate this very simply into mov
instructions (on x86 for example), or similar.
The core low-level code which performs the getting and setting of items is in the functions mapiter_trivial_get
and mapiter_trivial_set
. These functions are produced in lowlevel_strided_loops.c.src, where the templating and macros make it somewhat challenging to read (an occasion to be grateful for higher-level languages).
Persevering, we can eventually see that there is little difference between getitem and setitem. Here is a simplified version of the main loop for exposition. The macro lines determine whether were running getitem or setitem:
while (itersize--) {
char * self_ptr;
npy_intp indval = *((npy_intp*)ind_ptr);
#if @isget@
if (check_and_adjust_index(&indval, fancy_dim, 0, _save) < 0 ) {
return -1;
}
#else
if (indval < 0) {
indval += fancy_dim;
}
#endif
self_ptr = base_ptr + indval * self_stride; /* offset into array being indexed */
#if @isget@
*(npy_uint64 *)result_ptr = *(npy_uint64 *)self_ptr;
#else
*(npy_uint64 *)self_ptr = *(npy_uint64 *)result_ptr;
#endif
ind_ptr += ind_stride; /* move to next item of index array */
result_ptr += result_stride; /* move to next item of result array */
As we might expect, this simply amounts to some arithmetic to get the correct offset into the arrays, and then copying bytes from one memory location to another.
Extra index checks for setitem
One thing worth mentioning is that for setitem, the validity of the indices (whether they are all inbounds for the target array) is checked before copying begins (via check_and_adjust_index
), which also replaces negative indices with corresponding positive indices.
In the snippet above you can see check_and_adjust_index
called for getitem in the main loop, while a simpler (possibly redundant) check for negative indices occurs for setitem.
This extra preliminary check could conceivably have a small but negative impact on the speed of setitem (A[B] = C
).
Cache misses
Because the code for both code snippets is so similar, suspicion falls on the CPU and how it handles access to the underlying arrays of memory.
The CPU caches small blocks of memory (cache lines) that have been recently accessed in the anticipation that it will probably soon need to access that region of memory again.
For context, cache lines are generally 64 bytes. The L1 (fastest) data cache on my ageing laptop's CPU is 32KB (enough to hold around 500 int64 values from the array, but keep in mind that the CPU will be doing other things requiring other memory while the NumPy snippet executes):
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
As you are probably already aware, for reading/writing memory sequentially caching works well because 64 bytes blocks of memory are fetched as needed and stored closer to the CPU. Repeated access to that block of memory is quicker than fetching from RAM (or a slower higher-level cache). In fact, the CPU may even preemptively fetch the next cache line before it is even requested by the program.
On the other hand, randomly accessing memory is likely to cause frequent cache misses. Here, the region of memory with the required address is not in the fast cache near the CPU and instead must be accessed from a higher-level cache (slower) or the actual memory (much slower).
So which is faster for the CPU to handle: frequent data read misses, or data write misses?
Let's assume the CPU's write policy is write-back, meaning that a modified memory is written back to the cache. The cache is marked as being modified (or "dirty"), and the change will only be written back to main memory once the the line is evicted from the cache (the CPU can still read from a dirty cache line).
If we are writing to random points in a large array, the expectation is that many of the cache lines in the CPU's cache will become dirty. A write through to main memory will be needed as each one is evicted which may occur often if the cache is full.
However, this write through should happen less frequently when writing data sequentially and reading it at random, as we expect fewer cache lines to become dirty and data written back to main memory or slower caches less regularly.
As mentioned, this is a simplified model and there may be many other factors that influence the CPU's performance. Someone with more expertise than me may well be able to improve this model.