Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD:

__m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale);

What isn't clear to me from the documentation is whether the calculated load address is an element address or a byte address, i.e. is the load address for element i:

load_addr = base + index[i] * scale;               // (1) element addressing ?

or:

load_addr = (char *)base + index[i] * scale;       // (2) byte addressing ?

From the Intel docs it looks like it might be (2), but this doesn't make much sense given that the smallest element size for gathered loads is 32 bits - why would you want to load from misaligned addresses (i.e. use scale < 4) ?


Gather instructions do not have any alignment requirements. So it would be too restrictive not to allow byte addressing.

Other reason is consistency. With SIB addressing we obviously have byte address:

MOV eax, [rcx + rdx * 2]

Since VPGATHERDD is just a vectorized variant of this MOV instruction, we should not expect anything different with VSIB addressing:

VPGATHERDD ymm0, [rcx + ymm2 * 2], ymm3

As for real life use for byte addressing, we could have a 24-bit color image where each pixel is 3-byte aligned. We could load 8 pixels with single VPGATHERDD instruction but only if "scale" field in VSIB is "1" and VPGATHERDD uses byte addressing.


Judging by the description in Intel's AVX programming reference document available here, it looks like the gather instructions use byte addressing. Specifically, see the following quotes from the description of the VPGATHERDD instruction (on page 389):

DISP: optional 1, 2, 4 byte displacement;
DATA_ADDR = BASE_ADDR + (SignExtend(VINDEX[i+31:i])*SCALE + DISP;

Since you can use 1/2/4 byte displacements, I would assume that the overall memory address is a byte address. While it may not be a common application, there could be cases where you would want to read a 32- or 64-bit value from a misaligned address. That's one of the more flexible things about the x86 architecture when compared to something like ARM; you have the flexibility to perform misaligned accesses if you want, instead of triggering a CPU exception as some others do.