Using AVX CPU instructions: Poor performance without "/arch:AVX"
2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper()
even when compiling AVX intrinsics without /arch:AVX
. VS2010 did.
The behavior that you are seeing is the result of expensive state-switching.
See page 102 of Agner Fog's manual:
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
When you compile without /arch:AVX
, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX
tells the compiler to use all AVX.
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX
and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper()
or _mm256_zeroall()
appropriately to avoid the state-switching penalties.
tl;dr for old versions of MSVC only
Use _mm256_zeroupper();
or _mm256_zeroall();
around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX
for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.
In modern MSVC (and the other mainstream compilers, GCC/clang/ICC), the compiler knows when to use a vzeroupper
asm instruction. Forcing extra vzeroupper
s with intrinsics can hurt performance when inlining. See Do I need to use _mm256_zeroupper in 2021?
Cause
I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:
Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.
Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):
- 128-bit intrinsic instructions
- SSE inline assembly
- C/C++ floating point code that is compiled to Intel® SSE
- Calls to functions or libraries that include any of the above
This means there may even be penalties when linking with external code using SSE.
Details
There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:
When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.
You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.
Skylake has a different dirty-upper mechanism that causes false dependencies for legacy-SSE with dirty uppers, rather than one-time penalties. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Resolution
To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER
instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper
. The description for this intrinsic is very informative:
This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via
VZEROUPPER
, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.
In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.
Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER
or VZEROALL
instructions must be used.
One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER
at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256
/__m256d
/__m256i
datatype, and at the end of functions if the returned value is not a YMM register or __m256
/__m256d
/__m256i
datatype.
In the wild
This VZEROUPPER
solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:
/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
See Intel Optimization Manual (April 2011, version 248966), Section
11.3 */
#define VLEAVE _mm256_zeroupper
Then VLEAVE();
is called at the end of every function using intrinsics for AVX instructions.