Mis-aligned pointers on x86

Can someone provide an example were casting a pointer from one type to another fails due to mis-alignment?

In the comments to this answer, bothie states that doing something like

char * foo = ...;
int bar = *(int *)foo;

might lead to errors even on x86 if alignment-checking is enabled.

I tried to produce an error condition after setting the alignment-check flag via set $ps |= (1<<18) in GDB, but nothing happened.

What does a working (ie non-working ;)) example look like?

None of the code snippets from the answers fail on my system - I'll try it with a different compiler version and on a different pc later.

Btw, my own test code looked like this (now also using asm to set AC flag and unaligned read and write):

#include <assert.h>

int main(void)
{
    #ifndef NOASM
    __asm__(
        "pushf\n"
        "orl $(1<<18),(%esp)\n"
        "popf\n"
    );
    #endif

    volatile unsigned char foo[] = { 1, 2, 3, 4, 5, 6 };
    volatile unsigned int bar = 0;

    bar = *(int *)(foo + 1);
    assert(bar == 0x05040302);

    bar = *(int *)(foo + 2);
    assert(bar == 0x06050403);

    *(int *)(foo + 1) = 0xf1f2f3f4;
    assert(foo[1] == 0xf4 && foo[2] == 0xf3 && foo[3] == 0xf2 &&
        foo[4] == 0xf1);

    return 0;
}

The assertion passes without problems, even though the generated code definitely contains the unaligned access mov -0x17(%ebp), %edx and movl $0xf1f2f3f4,-0x17(%ebp).

So will setting AC trigger a SIGBUS or not? I couldn't get it to work on my Intel dual core laptop under Windows XP with none of the GCC versions I tested (MinGW-3.4.5, MinGW-4.3.0, Cygwin-3.4.4), whereas codelogic and Jonathan Leffler mentioned failures on x86...

Solution 1:

The situations are uncommon where unaligned access will cause problems on an x86 (beyond having the memory access take longer). Here are some of the ones I've heard about:

You might not count this as x86 issue, but SSE operations benefit from alignment. Aligned data can be used as a memory source operand to save instructions. Unaligned-load instructions like movups are slower than movaps on microarchitectures before Nehalem, but on Nehalem and later (and AMD Bulldozer-family), unaligned 16-byte loads/stores are about as efficient as unaligned 8-byte loads/stores; single uop and no penalty at all if the data happens to be aligned at runtime or doesn't cross a cache-line boundary, otherwise efficient hardware support for cache-line splits. 4k splits are very expensive (~100 cycles) until Skylake (down to ~10 cycles like a cache line split). See https://agner.org/optimize/ and performance links in the x86 tag wiki for more info.
interlocked operations (like lock add [mem], eax) are very slow if they aren't sufficiently aligned, especially if they cross a cache-line boundary so they can't just use a cache-lock inside the CPU core. On older (buggy) SMP systems, they might actually fail to be atomic (see https://blogs.msdn.com/oldnewthing/archive/2004/08/30/222631.aspx).
and another possibility discussed by Raymond Chen is when dealing with devices that have hardware banked memory (admittedly an oddball situation) - https://blogs.msdn.com/oldnewthing/archive/2004/08/27/221486.aspx
I recall (but don't have a reference for - so I'm not sure about this one) similar problems with unaligned accesses that straddle page boundaries that also involve a page fault. I'll see if I can dig up a reference for this.

And I learned something new when looking into this question (I was wondering about the "$ps |= (1<<18)" GDB command that was mentioned in a couple places). I didn't realize that x86 CPUs (starting with the 486 it seems) have the ability to cause an exception when a misaligned access is performed.

From Jeffery Richter's "Programming Applications for Windows, 4th Ed":

Let's take a closer look at how the x86 CPU handles data alignment. The x86 CPU contains a special bit flag in its EFLAGS register called the AC (alignment check) flag. By default, this flag is set to zero when the CPU first receives power. When this flag is zero, the CPU automatically does whatever it has to in order to successfully access misaligned data values. However, if this flag is set to 1, the CPU issues an INT 17H interrupt whenever there is an attempt to access misaligned data. The x86 version of Windows 2000 and Windows 98 never alters this CPU flag bit. Therefore, you will never see a data misalignment exception occur in an application when it is running on an x86 processor.

This was news to me.

Of course the big problem with misaligned accesses is that when you eventually go to compile the code for a non-x86/x64 processor you end up having to track down and fix a whole bunch of stuff, since virtually all other 32-bit or larger processors are sensitive to alignment issues.

Solution 2:

If you read up on the Core I7 architecture (specifically, their optimization literature), Intel has actually put a TON of hardware in there to make misaligned memory accesses nearly free. As far as I can tell, only a misalignment that crosses a cache line boundary has any extra cost at all - and even then it is minimal. AMD also has very little trouble with misaligned accesses (cycle-wise) as far as I remember (it's been a while though).

For what it's worth, I did set that flag in eflags (the AC bit - alignment check) when I was getting carried away optimizing a project that I was working on. It turns out that windows is FULL of misaligned accesses - so many that I wasn't able to locate any misaligned memory accesses in our code, I was bombarded with so many misaligned accesses in libraries and windows code that I didn't have time to continue.

Perhaps we can learn that when CPUs make things free or very low cost, programmers WILL become complacent and do things that have a little extra overhead. Perhaps Intel's engineers did some of that investigation, and found that typical x86 desktop software does millions of misaligned accesses per second, so they put incredibly fast misaligned access hardware in CoreI7.

HTH

Solution 3:

There is an additional condition, not mentioned, for EFLAGS.AC to actually take effect. CR0.AM must be set to prevent INT 17h from tripping on older OSes predating the 486 that have no handler for this exception. Unfortunately, Windows do not set it by default, you need to write a kernel-mode driver to set it.

Solution 4:

char *foo is probably aligned to int boundaries. Try this:

int bar = *(int *)(foo + 1);

Solution 5:

char *foo = "....";
foo++;
int *bar = (int *)foo;

The compiler would put foo on a word boundary, and then when you increment it it's at a word+1, which is invalid for a int pointer.