How much faster are SSE4.2 string instructions than SSE2 for memcmp?
Solution 1:
You called your function strcmp
, but what you've actually implemented is an alignment-required memcmp(const void *a, const void *b, size_t words)
. Both movdqa
and pcmpeqw xmm0, [mem]
will fault if the pointer isn't 16B-aligned. (Actually, if a+4
isn't 16B-aligned, because you do the first 4 scalar and increment by 4 bytes.)
With the right startup code and movdqu
, you could handle arbitrary alignments (reaching an alignment boundary for the pointer you want to use as a memory operand to pcmpeqw
). For convenience, you could require that both pointers are wide-char-aligned to start with, but you don't need to (especially since you're just returning true/false, not negative / 0 /
positive
as a sort order.)
You're asking about performance of SSE2 pcmpeqw
vs. pcmpistrm
, right? (The explicit-length SSE4.2 instructions like pcmpestrm
have worse throughput than the implicit-length versions, so use the implicit-length versions in your main loop when you're not close to the end of the string. See Agner Fog's instruction tables and microarch guide).
For memcmp (or carefully-implemented strcmp), the best you can do with SSE4.2 is slower than the best you can do with SSE2 (or SSSE3) on most CPUs. Maybe useful for very short strings, but not for the main loop of memcmp.
On Nehalem: pcmpistri
is 4 uops, 2c throughput (with a memory operand), so with no other loop overhead, it can keep up with memory. (Nehalem only has 1 load port). pcmpestri
has 6c throughput: 3x slower.
On Sandybridge through Skylake, pcmpistri xmm0, [eax]
has 3c throughput, so it's a factor of 3 too slow to keep up with 1 vector per clock (2 load ports). pcmpestri
has 4c throughput on most of those, so it's not as much worse. (Maybe useful for the last partial-vector, but not in the main loop).
On Silvermont/KNL, pcmpistrm
is the fastest, and runs at one per 14 cycle throughput, so it's total garbage for simple stuff.
On AMD Jaguar, pcmpistri
is 2c throughput, so it might actually be usable (only one load port). pcmpestri
is 5c throughput, so it sucks.
On AMD Ryzen, pcmpistri
is also 2c throughput, so it's crap there. (2 load ports and 5 uops per clock front-end throughput (or 6 uops if any (or all?) are from multi-uop instructions) mean you can go faster.
On AMD Bulldozer-family, pcmpistri
has 3c throughput until Steamroller, where it's 5c. pcmpestri
has 10c throughput. They're micro-coded as 7 or 27 m-ops, so AMD didn't spend a lot of silicon on them.
On most CPUs, they're only worth it if you're taking full advantage of them for stuff you can't do with just pcmpeq
/pmovmskb
. But if you can use AVX2 or especially AVX512BW, even doing complicated things might be faster with more instructions on wider vectors. (There are no wider versions of the SSE4.2 string instructions.) Maybe the SSE4.2 string instructions are still useful for functions that usually deal with short strings, because wide vector loops usually need more startup / cleanup overhead. Also, in a program that doesn't spend much time in SIMD loops, using AVX or AVX512 in one small function will still reduce your max turbo clock speed for the next millisecond or so, and could easily be a net loss.
A good inner loop should bottleneck on load throughput, or come as close as possible. movqdu
/ pcmpeqw [one-register]
/ pmovmskb
/ macro-fused-cmp+jcc is only 4 fused-domain uops, so this is almost achievable on Sandybridge-family CPUs
See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for an implementation and some benchmarks, but that's for C-style implicit-length strings where you have to check for 0
bytes. It looks like you're using explicit-length strings, so after checking that the lengths are equal, it's just memcmp
. (Or I guess if you need to find the sort order instead of just equal / not-equal, you'd have to memcmp out to the end of the shorter string.)
For strcmp with 8-bit strings, on most CPUs it's faster not to use the SSE4.2 string instructions. See the comments on the strchr.com article for some benchmarks (of that implicit-length string version). glibc for example doesn't use the SSE4.2 string instructions for strcmp
, because they're not faster on most CPUs. They might be a win for strstr
though.
glibc has several SSE2/SSSE3 asm strcmp
and memcmp
implementations. (It's LGPLed, so you can't just copy it into non-GPL projects, but have a look at what they do.) Some of the string functions (like strlen) only branch per 64 bytes, and then come back to sort out which byte within the cache line had the hit. But their memcmp implementation just unrolls with movdqu / pcmpeqb
. You can use pcmpeqw
since you want to know the position of the first 16-bit element that's different, rather than the first byte.
Your SSE2 implementation could be even faster. You should use the indexed addressing mode with movdqa since it won't micro-fuse with pcmpeqw (on Intel Sandybridge/Ivybridge; fine on Nehalem or Haswell+), but pcmpeqw xmm0, [eax]
will stay micro-fused without unlaminating.
You should unroll a couple times to reduce loop overhead. You should combine the pointer-increment with the loop counter so you cmp/jb
instead of sub/ja
: macro-fusion on more CPUs, and avoids writing a register (reducing the amount of physical registers needed for register-renaming).
Your inner loop, on Intel Sandybridge/Ivybridge, will run
@@To1:
movdqa xmm0, DQWORD PTR [eax] // 1 uop
pcmpeqw xmm0, DQWORD PTR [eax+edx] // 2 uops on Intel SnB/IvB, 1 on Nehalem and earlier or Haswell and later.
pmovmskb ebx, xmm0 // 1 uop
cmp ebx, 65535
jne @@Final // 1 uop (macro-fused with cmp)
add eax, 16 // 1 uop
sub ecx, 8
{ AnsiChar : sub ecx, 16 }
ja @@To1 // 1 uop (macro-fused with sub on SnB and later, otherwise 2)
This is 7 fused-domain uops, so it can only issue from the front-end at best 7/4 cycles per iteration on mainstream Intel CPUs. This is very far from bottlenecking on 2 loads per clock. On Haswell and later, it's 6/4 cycles per iteration, because indexed addressing modes can stay micro-fused with 2-operand load-modify instruction like pcmpeqw
, but not anything else (like pabsw xmm0, [eax+edx]
(doesn't read destination) or AVX vpcmpeqw xmm0, xmm0, [eax+edx]
(3 operands)). See Micro fusion and addressing modes.
This could be more efficient for small strings with better setup/cleanup, too.
In your pointer-setup code, you could save a cmp
if you check for NULL pointers first. You can sub
/ jne
to subtract and check for both equal with the same macro-fused compare and branch. (It will only macro-fuse on Intel Sandybridge-family, and only Haswell can make 2 macro-fusions in a single decode block. But Haswell/Broadwell/Skylake CPUs are common and becoming ever more common, and this has no downside for other CPUs unless equal-pointers is so common that doing that check first matters.)
In your return path: Always use xor eax,eax
to zero a register whenever possible, not mov eax, 0
.
You don't seem to avoid reading from past the end of the string. You should test your function with strings that end right at the end of a page, where the next page is unmapped.
xor ebx, [eax + edx]
has zero advantages over cmp
for the early-out scalar test. cmp/jnz
can macro-fuse with the jcc, but xor
can't.
You load a mask to handle the cleanup to cover the case where you read past the end of the string. You could probably still use the usual bsf
to find the first difference in the bitmap. I guess invert it with not
to find the first position that didn't compare equal, and check that that's less than the remaining string length.
Or you could generate the mask on the fly with mov eax, -1
and shr
, I think. Or for loading it, you can sometimes use a sliding window into a ...,0,0,0,-1,-1,-1,...
array, but you need sub-byte offsets so that doesn't work. (It works well for vector masks, if you wanted to mask and redo the pmovmskb
. Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all).
Your way isn't bad, as long as it doesn't cache miss. I'd probably go for generating the mask on the fly. Maybe before the loop in another register, because you can mask to get count % 8
, so the mask-generation can happen in parallel with the loop.