Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%
Solution 1:
IACA Analysis
Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.
IACA reports the following:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
| 2 | 0.5 | 0.5 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
| 1 | | | | | | | 1.0 | | | add rax, 0x20
| 0F | | | | | | | | | | jnz 0xffffffffffffffec
Total Num Of Uops: 6
In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.
A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39
was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20
:
Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.
This firmly settles why Port 7 wasn't being used.
Now, contrast the above with a 32x unrolled loop (it turns out unroll16
shoudl actually be called unroll32
):
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 16.0 0.0 | 16.0 | 32.0 32.0 | 32.0 32.0 | 32.0 | 2.0 | 2.0 | 32.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
| 2^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
| 1 | | | | | | 1.0 | | | | add r9, 0x400
| 1 | | | | | | | 1.0 | | | add r10, 0x400
| 1 | | | | | | 1.0 | | | | add r11, 0x400
| 1 | | | | | | | 1.0 | | | cmp r9, rcx
| 0F | | | | | | | | | | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164
We see here micro-fusion and correct scheduling of the store to Port 7.
Manual Analysis (see edit above)
I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.
I padded the arrays x
, y
and z
to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:
.L2:
vmovaps ymm1, [rdi+rax] ; 1L
vmovaps ymm0, [rsi+rax] ; 2L
vmovaps [rdx+rax], ymm2 ; S1
add rax, 32 ; ADD
jne .L2 ; JMP
This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.
I then tested every single permutation of the first and second loads (1L
and 2L
), the store (S1
) and the add (A
) while leaving the conditional jump (J
) at the end, and for each of these I tested every possible combination of offsets of x
, y
and z
by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32
before one of the r+r
indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost
under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):
Cases: 0 1 2 3 4 5 6 7
L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S
-0 -0 -0 -0 -0 -32 -0 -32 -0 -0 -32 -32 -32 -0 -0 -32 -0 -32 -32 -32 -0 -32 -32 -32
________________________________________________________________________________________________
12SAJ: 65.34% 65.34% 49.63% 65.07% 49.70% 65.05% 49.22% 65.07%
12ASJ: 48.59% 64.48% 48.74% 49.69% 48.75% 49.69% 48.99% 48.60%
1A2SJ: 49.69% 64.77% 48.67% 64.06% 49.69% 49.69% 48.94% 49.69%
1AS2J: 48.61% 64.66% 48.73% 49.71% 48.77% 49.69% 49.05% 48.74%
1S2AJ: 49.66% 65.13% 49.49% 49.66% 48.96% 64.82% 49.02% 49.66%
1SA2J: 64.44% 64.69% 49.69% 64.34% 49.69% 64.41% 48.75% 64.14%
21SAJ: 65.33%* 65.34% 49.70% 65.06% 49.62% 65.07% 49.22% 65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91% 65.19% 49.04% 49.72% 49.12% 49.63% 49.21% 48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69% 64.93% 49.70% 64.66% 49.69% 64.27% 48.71% 64.56%
S12AJ: 48.90% 65.20% 49.12% 49.63% 49.03% 49.70% 49.21%* 48.94%
S1A2J: 49.69% 64.74% 48.65% 64.48% 49.43% 49.69% 48.66% 49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62% 64.45% 49.69% 64.57% 49.69% 64.45% 48.58% 63.99%
A1S2J: 49.72% 64.69% 49.72% 49.72% 48.67% 64.46% 48.95% 49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71% 64.53% 48.76% 49.69% 48.76% 49.74% 48.93% 48.69%
We can notice a few things from the table:
- Several plateaux of results, but two main ones only: Just under 50% and around 65%.
- L1 and L2 can permute freely between each other without affecting the result.
- Offsetting the accesses by -32 bytes can change efficiency.
- The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:
-
12SAJ
Case 0 (No offsets applied), with efficiency 65.34% (the highest) -
12ASJ
Case 1 (S-32
), with efficiency 64.48% -
1A2SJ
Case 3 (2L-32
,S-32
), with efficiency 64.06% -
A12SJ
Case 7 (1L-32
,2L-32
,S-32
), with efficiency 63.99%
-
- There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where
S-32
) seems to guarantee this. - Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
- For cases 0, 3, 5 and 7, it depends on the permutation.
Whence we may draw at least a few conclusions:
- Execution ports 2 and 3 really don't care which load address they generate and load from.
- Macro-op fusion of the
add
andjmp
appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of theadd
from thejne
does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.- What Evgeny was seeing (Going from
12SAJ
with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch. - Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
- What Evgeny was seeing (Going from
- Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.
I will hypothesize several possible explanations:
- We're seeing some wierd perversion due to the value of the addresses relative to each other.
- If so then there would exist a set of offsets of
x
,y
andz
that would allow maximum throughput. Quick random tests on my part seem not to support this.
- If so then there would exist a set of offsets of
-
We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.
-
This could be macro-op fusion being affected by the decoders. From Agner Fog:
Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.
- Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.
- If somebody has access to the Intel performance counters, he should look at the events
UOPS_EXECUTED_PORT.PORT_[0-7]
. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).
- If somebody has access to the Intel performance counters, he should look at the events
-
And here's what I think is not happening:
- I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under
Haswell -> Control transfer instructions
). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.
I believe this is a problem that will be solved with Intel's performance counters.