SIMD instructions lowering CPU frequency

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

The frequency impact depends on the width of the operation and the specific instruction used.

There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo⁵, originally associated with AVX and AVX2 instructions¹. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

Xeon Gold 5120 Frequencies

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions² always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes⁴. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related³ downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.

¹ Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

² 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

³ Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

⁴ Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

⁵ Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.

It's not the instruction mnemonic that matters, it's 512-bit vector width at all that matters.

You can use the 256-bit version of AVX-512VL instructions, e.g. vpternlogd ymm0, ymm1, ymm2 without incurring the AVX-512 turbo penalty.

Related: Dynamically determining where a rogue AVX-512 instruction is executing is about a case where one AVX-512 instruction in glibc init code or something left a dirty upper ZMM that gimped max turbo for the rest of the process lifetime. (Or until a vzeroupper maybe)

Although there can be other turbo impacts from light / heavy use of 256-bit FP math instructions, and some of that is due to heat. But usually 256-bit is worth it on modern CPUs.

Anyway, this is why gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256. For any given workload, it's worth trying -mprefer-vector-width=512 and maybe also 128, depending on how much or how little of the work can usefully auto-vectorize.

Tell GCC to tune for your CPU (e.g. -march=native) and it will hopefully make good choices. Although on a desktop Skylake-X, the turbo penalty is smaller than a Xeon. And if your code does actually benefit from 512-bit vectorization, it can be worth it to pay the penalty.

(Also beware the other major effect of Skylake-family CPUs going into 512-bit vector mode: the vector ALUs on port 1 shut down, so only scalar instructions like popcnt or add can use port 1. So vpand and vpaddb etc. throughput drops from 3 to 2 per clock. And if you're on an SKX with two 512-bit FMA units, the extra one on port 5 powers up, so then FMAs compete with shuffles.)