Why is writing to memory much slower than reading it?
With your programs, I get
(write) Bandwidth = 6.076 GB/s
(read) Bandwidth = 10.916 GB/s
on a desktop (Core i7, x86-64, GCC 4.9, GNU libc 2.19) machine with six 2GB DIMMs. (I don't have any more detail than that to hand, sorry.)
However, this program reports write bandwidth of 12.209 GB/s
:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <emmintrin.h>
static void
nt_memset(char *buf, unsigned char val, size_t n)
{
/* this will only work with aligned address and size */
assert((uintptr_t)buf % sizeof(__m128i) == 0);
assert(n % sizeof(__m128i) == 0);
__m128i xval = _mm_set_epi8(val, val, val, val,
val, val, val, val,
val, val, val, val,
val, val, val, val);
for (__m128i *p = (__m128i*)buf; p < (__m128i*)(buf + n); p++)
_mm_stream_si128(p, xval);
_mm_sfence();
}
/* same main() as your write test, except calling nt_memset instead of memset */
The magic is all in _mm_stream_si128
, aka the machine instruction movntdq
, which writes a 16-byte quantity to system RAM, bypassing the cache (the official jargon for this is "non-temporal store"). I think this pretty conclusively demonstrates that the performance difference is all about the cache behavior.
N.B. glibc 2.19 does have an elaborately hand-optimized memset
that makes use of vector instructions. However, it does not use non-temporal stores. That's probably the Right Thing for memset
; in general, you clear memory shortly before using it, so you want it to be hot in the cache. (I suppose an even cleverer memset
might switch to non-temporal stores for really huge block clear, on the theory that you could not possibly want all of that in the cache, because the cache simply isn't that big.)
Dump of assembler code for function memset:
=> 0x00007ffff7ab9420 <+0>: movd %esi,%xmm8
0x00007ffff7ab9425 <+5>: mov %rdi,%rax
0x00007ffff7ab9428 <+8>: punpcklbw %xmm8,%xmm8
0x00007ffff7ab942d <+13>: punpcklwd %xmm8,%xmm8
0x00007ffff7ab9432 <+18>: pshufd $0x0,%xmm8,%xmm8
0x00007ffff7ab9438 <+24>: cmp $0x40,%rdx
0x00007ffff7ab943c <+28>: ja 0x7ffff7ab9470 <memset+80>
0x00007ffff7ab943e <+30>: cmp $0x10,%rdx
0x00007ffff7ab9442 <+34>: jbe 0x7ffff7ab94e2 <memset+194>
0x00007ffff7ab9448 <+40>: cmp $0x20,%rdx
0x00007ffff7ab944c <+44>: movdqu %xmm8,(%rdi)
0x00007ffff7ab9451 <+49>: movdqu %xmm8,-0x10(%rdi,%rdx,1)
0x00007ffff7ab9458 <+56>: ja 0x7ffff7ab9460 <memset+64>
0x00007ffff7ab945a <+58>: repz retq
0x00007ffff7ab945c <+60>: nopl 0x0(%rax)
0x00007ffff7ab9460 <+64>: movdqu %xmm8,0x10(%rdi)
0x00007ffff7ab9466 <+70>: movdqu %xmm8,-0x20(%rdi,%rdx,1)
0x00007ffff7ab946d <+77>: retq
0x00007ffff7ab946e <+78>: xchg %ax,%ax
0x00007ffff7ab9470 <+80>: lea 0x40(%rdi),%rcx
0x00007ffff7ab9474 <+84>: movdqu %xmm8,(%rdi)
0x00007ffff7ab9479 <+89>: and $0xffffffffffffffc0,%rcx
0x00007ffff7ab947d <+93>: movdqu %xmm8,-0x10(%rdi,%rdx,1)
0x00007ffff7ab9484 <+100>: movdqu %xmm8,0x10(%rdi)
0x00007ffff7ab948a <+106>: movdqu %xmm8,-0x20(%rdi,%rdx,1)
0x00007ffff7ab9491 <+113>: movdqu %xmm8,0x20(%rdi)
0x00007ffff7ab9497 <+119>: movdqu %xmm8,-0x30(%rdi,%rdx,1)
0x00007ffff7ab949e <+126>: movdqu %xmm8,0x30(%rdi)
0x00007ffff7ab94a4 <+132>: movdqu %xmm8,-0x40(%rdi,%rdx,1)
0x00007ffff7ab94ab <+139>: add %rdi,%rdx
0x00007ffff7ab94ae <+142>: and $0xffffffffffffffc0,%rdx
0x00007ffff7ab94b2 <+146>: cmp %rdx,%rcx
0x00007ffff7ab94b5 <+149>: je 0x7ffff7ab945a <memset+58>
0x00007ffff7ab94b7 <+151>: nopw 0x0(%rax,%rax,1)
0x00007ffff7ab94c0 <+160>: movdqa %xmm8,(%rcx)
0x00007ffff7ab94c5 <+165>: movdqa %xmm8,0x10(%rcx)
0x00007ffff7ab94cb <+171>: movdqa %xmm8,0x20(%rcx)
0x00007ffff7ab94d1 <+177>: movdqa %xmm8,0x30(%rcx)
0x00007ffff7ab94d7 <+183>: add $0x40,%rcx
0x00007ffff7ab94db <+187>: cmp %rcx,%rdx
0x00007ffff7ab94de <+190>: jne 0x7ffff7ab94c0 <memset+160>
0x00007ffff7ab94e0 <+192>: repz retq
0x00007ffff7ab94e2 <+194>: movq %xmm8,%rcx
0x00007ffff7ab94e7 <+199>: test $0x18,%dl
0x00007ffff7ab94ea <+202>: jne 0x7ffff7ab950e <memset+238>
0x00007ffff7ab94ec <+204>: test $0x4,%dl
0x00007ffff7ab94ef <+207>: jne 0x7ffff7ab9507 <memset+231>
0x00007ffff7ab94f1 <+209>: test $0x1,%dl
0x00007ffff7ab94f4 <+212>: je 0x7ffff7ab94f8 <memset+216>
0x00007ffff7ab94f6 <+214>: mov %cl,(%rdi)
0x00007ffff7ab94f8 <+216>: test $0x2,%dl
0x00007ffff7ab94fb <+219>: je 0x7ffff7ab945a <memset+58>
0x00007ffff7ab9501 <+225>: mov %cx,-0x2(%rax,%rdx,1)
0x00007ffff7ab9506 <+230>: retq
0x00007ffff7ab9507 <+231>: mov %ecx,(%rdi)
0x00007ffff7ab9509 <+233>: mov %ecx,-0x4(%rdi,%rdx,1)
0x00007ffff7ab950d <+237>: retq
0x00007ffff7ab950e <+238>: mov %rcx,(%rdi)
0x00007ffff7ab9511 <+241>: mov %rcx,-0x8(%rdi,%rdx,1)
0x00007ffff7ab9516 <+246>: retq
(This is in libc.so.6
, not the program itself -- the other person who tried to dump the assembly for memset
seems only to have found its PLT entry. The easiest way to get the assembly dump for the real memset
on a Unixy system is
$ gdb ./a.out
(gdb) set env LD_BIND_NOW t
(gdb) b main
Breakpoint 1 at [address]
(gdb) r
Breakpoint 1, [address] in main ()
(gdb) disas memset
...
.)
The main difference in the performance comes from the caching policy of your PC/memory region. When you read from a memory and the data is not in the cache, the memory must be first fetched to the cache through memory bus before you can perform any computation with the data. However, when you write to memory there are different write policies. Most likely your system is using write-back cache (or more precisely "write allocate"), which means that when you write to a memory location that's not in the cache, the data is first fetched from the memory to the cache and eventually written back to memory when the data is evicted from cache, which means round-trip for the data and 2x bus bandwidth usage upon writes. There is also write-through caching policy (or "no-write allocate") which generally means that upon cache-miss at writes the data isn't fetched to the cache, and which should give closer to the same performance for both reads and writes.
The difference -- at least on my machine, with an AMD processor -- is that the read program is using vectorized operations. Decompiling the two yields this for the writing program:
0000000000400610 <main>:
...
400628: e8 73 ff ff ff callq 4005a0 <clock@plt>
40062d: 49 89 c4 mov %rax,%r12
400630: 89 de mov %ebx,%esi
400632: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
400637: 48 89 ef mov %rbp,%rdi
40063a: e8 71 ff ff ff callq 4005b0 <memset@plt>
40063f: 0f b6 55 00 movzbl 0x0(%rbp),%edx
400643: b9 64 00 00 00 mov $0x64,%ecx
400648: be 34 08 40 00 mov $0x400834,%esi
40064d: bf 01 00 00 00 mov $0x1,%edi
400652: 31 c0 xor %eax,%eax
400654: 48 83 c3 01 add $0x1,%rbx
400658: e8 a3 ff ff ff callq 400600 <__printf_chk@plt>
But this for the reading program:
00000000004005d0 <main>:
....
400609: e8 62 ff ff ff callq 400570 <clock@plt>
40060e: 49 d1 ee shr %r14
400611: 48 89 44 24 18 mov %rax,0x18(%rsp)
400616: 4b 8d 04 e7 lea (%r15,%r12,8),%rax
40061a: 4b 8d 1c 36 lea (%r14,%r14,1),%rbx
40061e: 48 89 44 24 10 mov %rax,0x10(%rsp)
400623: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400628: 4d 85 e4 test %r12,%r12
40062b: 0f 84 df 00 00 00 je 400710 <main+0x140>
400631: 49 8b 17 mov (%r15),%rdx
400634: bf 01 00 00 00 mov $0x1,%edi
400639: 48 8b 74 24 10 mov 0x10(%rsp),%rsi
40063e: 66 0f ef c0 pxor %xmm0,%xmm0
400642: 31 c9 xor %ecx,%ecx
400644: 0f 1f 40 00 nopl 0x0(%rax)
400648: 48 83 c1 01 add $0x1,%rcx
40064c: 66 0f ef 06 pxor (%rsi),%xmm0
400650: 48 83 c6 10 add $0x10,%rsi
400654: 49 39 ce cmp %rcx,%r14
400657: 77 ef ja 400648 <main+0x78>
400659: 66 0f 6f d0 movdqa %xmm0,%xmm2 ;!!!! vectorized magic
40065d: 48 01 df add %rbx,%rdi
400660: 66 0f 73 da 08 psrldq $0x8,%xmm2
400665: 66 0f ef c2 pxor %xmm2,%xmm0
400669: 66 0f 7f 04 24 movdqa %xmm0,(%rsp)
40066e: 48 8b 04 24 mov (%rsp),%rax
400672: 48 31 d0 xor %rdx,%rax
400675: 48 39 dd cmp %rbx,%rbp
400678: 74 04 je 40067e <main+0xae>
40067a: 49 33 04 ff xor (%r15,%rdi,8),%rax
40067e: 4c 89 ea mov %r13,%rdx
400681: 49 89 07 mov %rax,(%r15)
400684: b9 64 00 00 00 mov $0x64,%ecx
400689: be 04 0a 40 00 mov $0x400a04,%esi
400695: e8 26 ff ff ff callq 4005c0 <__printf_chk@plt>
40068e: bf 01 00 00 00 mov $0x1,%edi
400693: 31 c0 xor %eax,%eax
Also, note that your "homegrown" memset
is actually optimized down to a call to memset
:
00000000004007b0 <my_memset>:
4007b0: 48 85 d2 test %rdx,%rdx
4007b3: 74 1b je 4007d0 <my_memset+0x20>
4007b5: 48 83 ec 08 sub $0x8,%rsp
4007b9: 40 0f be f6 movsbl %sil,%esi
4007bd: e8 ee fd ff ff callq 4005b0 <memset@plt>
4007c2: 48 83 c4 08 add $0x8,%rsp
4007c6: c3 retq
4007c7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4007ce: 00 00
4007d0: 48 89 f8 mov %rdi,%rax
4007d3: c3 retq
4007d4: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4007db: 00 00 00
4007de: 66 90 xchg %ax,%ax
I can't find any references regarding whether or not memset
uses vectorized operations, the disassembly of memset@plt
is unhelpful here:
00000000004005b0 <memset@plt>:
4005b0: ff 25 72 0a 20 00 jmpq *0x200a72(%rip) # 601028 <_GLOBAL_OFFSET_TABLE_+0x28>
4005b6: 68 02 00 00 00 pushq $0x2
4005bb: e9 c0 ff ff ff jmpq 400580 <_init+0x20>
This question suggests that since memset
is designed to handle every case, it might be missing some optimizations.
This guy definitely seems convinced that you need to roll your own assembler memset
to take advantage of SIMD instructions. This question does, too.
I'm going to take a shot in the dark and guess that it's not using SIMD operations because it can't tell whether or not it's going to be operating on something that's a multiple of the size of one vectorized operation, or there's some alignment-related issue.
However, we can confirm that it's not an issue of cache efficiency by checking with cachegrind
. The write program produces the following:
==19593== D refs: 6,312,618,768 (80,386 rd + 6,312,538,382 wr)
==19593== D1 misses: 1,578,132,439 ( 5,350 rd + 1,578,127,089 wr)
==19593== LLd misses: 1,578,131,849 ( 4,806 rd + 1,578,127,043 wr)
==19593== D1 miss rate: 24.9% ( 6.6% + 24.9% )
==19593== LLd miss rate: 24.9% ( 5.9% + 24.9% )
==19593==
==19593== LL refs: 1,578,133,467 ( 6,378 rd + 1,578,127,089 wr)
==19593== LL misses: 1,578,132,871 ( 5,828 rd + 1,578,127,043 wr) <<
==19593== LL miss rate: 9.0% ( 0.0% + 24.9% )
and the read program produces:
==19682== D refs: 6,312,618,618 (6,250,080,336 rd + 62,538,282 wr)
==19682== D1 misses: 1,578,132,331 (1,562,505,046 rd + 15,627,285 wr)
==19682== LLd misses: 1,578,131,740 (1,562,504,500 rd + 15,627,240 wr)
==19682== D1 miss rate: 24.9% ( 24.9% + 24.9% )
==19682== LLd miss rate: 24.9% ( 24.9% + 24.9% )
==19682==
==19682== LL refs: 1,578,133,357 (1,562,506,072 rd + 15,627,285 wr)
==19682== LL misses: 1,578,132,760 (1,562,505,520 rd + 15,627,240 wr) <<
==19682== LL miss rate: 4.1% ( 4.1% + 24.9% )
While the read program has a lower LL miss rate because it performs many more reads (an extra read per XOR
operation), the total number of misses is the same. So whatever the issue is, it's not there.
Caching and locality almost certainly explain most of the effects you are seeing.
There isn't any caching or locality on writes, unless you want a non-deterministic system. Most write times are measured as the time it takes for the data to get all the way to the storage medium (whether this is a hard drive or a memory chip), whereas reads can come from any number of cache layers that are faster than the storage medium.