Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

Solution 1:

IACA Analysis

Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.

IACA reports the following:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.

A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39 was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20:

Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.

This firmly settles why Port 7 wasn't being used.

Now, contrast the above with a 32x unrolled loop (it turns out unroll16 shoudl actually be called unroll32):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164

We see here micro-fusion and correct scheduling of the store to Port 7.

Manual Analysis (see edit above)

I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.

I padded the arrays x, y and z to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:

.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP

This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.

I then tested every single permutation of the first and second loads (1L and 2L), the store (S1) and the add (A) while leaving the conditional jump (J) at the end, and for each of these I tested every possible combination of offsets of x, y and z by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32 before one of the r+r indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):

Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%

We can notice a few things from the table:

  • Several plateaux of results, but two main ones only: Just under 50% and around 65%.
  • L1 and L2 can permute freely between each other without affecting the result.
  • Offsetting the accesses by -32 bytes can change efficiency.
  • The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:
    • 12SAJ Case 0 (No offsets applied), with efficiency 65.34% (the highest)
    • 12ASJ Case 1 (S-32), with efficiency 64.48%
    • 1A2SJ Case 3 (2L-32, S-32), with efficiency 64.06%
    • A12SJ Case 7 (1L-32, 2L-32, S-32), with efficiency 63.99%
  • There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where S-32) seems to guarantee this.
  • Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
  • For cases 0, 3, 5 and 7, it depends on the permutation.

Whence we may draw at least a few conclusions:

  • Execution ports 2 and 3 really don't care which load address they generate and load from.
  • Macro-op fusion of the add and jmp appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.
    • What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.
    • Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
  • Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.

I will hypothesize several possible explanations:

  • We're seeing some wierd perversion due to the value of the addresses relative to each other.
    • If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.
  • We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.

    • This could be macro-op fusion being affected by the decoders. From Agner Fog:

      Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.

    • Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.
      • If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).

And here's what I think is not happening:

  • I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.

I believe this is a problem that will be solved with Intel's performance counters.