What's the actual effect of successful unaligned accesses on x86?

I always hear that unaligned accesses are bad because they will either cause runtime errors and crash the program or slow memory accesses down. However I can't find any actual data on how much they will slow things down.

Suppose I'm on x86 and have some (yet unknown) share of unaligned accesses - what's the worst slowdown actually possible and how do I estimate it without eliminating all unaligned accesses and comparing run time of two versions of code?


Solution 1:

It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).

For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.

Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.

Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.

Solution 2:

On some Intel micro-architectures, a load that is split by a cacheline boundary takes a dozen cycles longer than usual, and a load that is split by a page boundary takes over 200 cycles longer. It's bad enough that if loads are going to be consistently misaligned in a loop, it's worth doing two aligned loads and merging the results manually, even if palignr is not an option. Even SSE's unaligned loads won't save you, unless they are split exactly down the middle.

On AMD's this was never a problem, and the problem mostly disappeared in Nehalem, but there are still a lot of Core2's out there too.

Solution 3:

In general estimating speed on modern processors is extremely complicated. This is true not only for unaligned accesses but in general.

Modern processors have pipelined architectures, out of order and possibly parallel execution of instructions and many other things that may impact execution.

If the unaligned access is not supported you get an exception. But if it is supported you may or may not get a slowdown depending on a lot of factors. These factors include what other instructions you were executing both before and after the unaligned one (because the processor may be able to start fetching your data while executing previous instructions or to go ahead and perform subsequent instructions while it waits).

Another very important difference happens if the unaligned access happens across cacheline boundaries. Wile in general a 2x access to the cache may happen for an unaligned access, the real slowdown is if the access crosses a cacheline boundary and causes a double cache miss. In the worst possible case a 2 byte unaligned read may require the processor to flush out two cachelines to memory and then read 2 chachelines from memory. That's a whole lot of data moving.

The general rule for optimization also applies here: first code, then measure, then if and only if there is a problem figure out a solution.