REP MOVSB for overlapped memory
rep movsb
always acts exactly as if it did this. Sometimes it can run fast (wide loads/stores) and still be equivalent; sometimes it has to run slow to preserve the exact semantics in case of dst close to src in the direction of DF.
char *rdi, *rsi;
size_t rcx; // incoming register "args"
for( ; rcx != 0 ; rcx--) { // rep movsb. Interruptible after a complete iteration
*rdi = *rsi;
rdi += (DF == 0 ? 1 : -1);
rsi += (DF == 0 ? 1 : -1);
}
If run with dst = src+1
, DF=0, and count = 16 for example, that loop (and thus rep movsb
) would repeat the first byte 16 times. Each load would read the value stored by the previous store.
That's a valid implementation of memcpy
, because ISO C doesn't define the behaviour in the overlap case.
But it's not a valid implementation of memmove
, which is required to copy as if it read all of the source before overwriting the destination. (cppreference). So in this case, copy all the bytes forward by 1.
The standard way to achieve that without actually bouncing all the data to a temporary buffer and back is to detect if overlap would be a problem for going forwards, and if so work backwards from the ends of the buffers.
Or on systems where copying backwards is just as efficient, just branch based on dst > src
unsigned compare without bringing the size into it. See Should pointer comparisons be signed or unsigned in 64-bit x86? re: the details of how one would do a comparison for possible overlap like dst+size > src
Performance
And yes, as AMD says, in current CPUs from AMD and Intel, it's much faster for DF=0, with DF=1 falling back to an actual byte-at-a-time microcode loop, instead of using fast-strings / ERMSB microcode that goes 16, 32, or 64 bytes at a time.
For medium sized copies and larger (a couple KiB or more), rep movsb
on aligned src and dst with DF=0 is similar speed to an unrolled SIMD loop with the widest vectors the CPU supports, on most CPUs, within maybe 10 or 20%. (Also assuming that dst is far enough ahead of src to not cause overlap with wide SIMD loads in the microcode, or that it's below src. You could test what distance produces a fallback to a slow path.)