Why is memmove faster than memcpy?
Your memmove
calls are shuffling memory along by 2 to 128 bytes, while your memcpy
source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy
ends up possibly a smidge faster, e.g. on ideone.com:
memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919
Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy
unnecessarily slower when compared apples-for-apples.
When you are using memcpy
, the writes need to go into the cache. When you use memmove
where when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmove
where the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.
I guarantee that ALL is about cache maintenance when you do large memory operations.
Historically, memmove and memcpy are the same function. They worked in the same way and had the same implementation. It was then realised that memcpy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.
The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. memcpy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.
The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.
You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.
Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)
Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.
"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.
In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.
Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00