What's the difference between logical SSE intrinsics?
Solution 1:
- Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
Yes, there can be performance reasons to choose one vs. the other.
1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.
See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.
However, the delays for passing data between the different domains or different types of registers are smaller on the Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. -- Agner Fog's micro arch doc
Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd
instead of movaps + shufps
can be a win if uop throughput is your bottleneck, rather than latency of your critical path.
2: The ...ps
version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.
3: Recent Intel CPUs can only run the FP versions on port5.
-
Merom (Core2) and Penryn:
orps
can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.) -
Nehalem / Sandybridge / IvB / Haswell / Broadwell:
por
can run on p0/p1/p5, butorps
can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1. -
Skylake:
por
andorps
both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.
Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm
requires AVX2. This was probably not the reason for the change, since Nehalem did this.
How to choose wisely:
Keep in mind that compilers can use por
for _mm_or_pd
if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.
If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.
AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.
If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps
logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps
versions are one byte shorter (without AVX).
There are practical / human-factor reasons for using the ...pd
versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128
and __m128d
is not worth it. (And hopefully a compiler will use orps
for _mm_or_pd
anyway, if compiling without AVX where it will actually save a byte.)
If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)
For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd
or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)
- These intrinsics maps to three different x86 instructions (
por
,orps
,orpd
). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?
If you look at the history of these instruction sets, you can kind of see how we got here.
por (MMX): 0F EB /r
orps (SSE): 0F 56 /r
orpd (SSE2): 66 0F 56 /r
por (SSE2): 66 0F EB /r
MMX existed before SSE, so it looks like opcodes for SSE (...ps
) instructions were chosen out of the same 0F xx
space. Then for SSE2, the ...pd
version added a 66
operand-size prefix to the ...ps
opcode, and the integer version added a 66
prefix to the MMX version.
They could have left out orpd
and/or por
, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.
Related / near duplicate:
- What is the point of SSE2 instructions such as orpd? also summarizes the history. (But I wrote it 5 years later.)
- Difference between the AVX instructions vxorpd and vpxor
- Does using mix of pxor and xorps affect performance?
Solution 2:
According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.
Solution 3:
I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.
#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>
int main(void)
{
__m128i a = _mm_set1_epi32(1);
__m128i b = _mm_set1_epi32(2);
__m128i c = _mm_or_si128(a, b);
__m128 x = _mm_set1_ps(1.25f);
__m128 y = _mm_set1_ps(1.5f);
__m128 z = _mm_or_ps(x, y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
return 0;
}
Terminal:
$ gcc -Wall -msse3 por.c -o por
$ ./por
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000