Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

Is it legal to reinterpret_cast a float* to a __m256* and access float objects through a different pointer type?

constexpr size_t _m256_float_step_sz = sizeof(__m256) / sizeof(float);
alignas(__m256) float stack_store[100 * _m256_float_step_sz ]{};
__m256& hwvec1 = *reinterpret_cast<__m256*>(&stack_store[0 * _m256_float_step_sz]);

using arr_t = float[_m256_float_step_sz];
arr_t& arr1 = *reinterpret_cast<float(*)[_m256_float_step_sz]>(&hwvec1);

Do hwvec1 and arr1 depend on undefined behaviors?

Do they violate strict aliasing rules? [basic.lval]/11

Or there is only one defined way of intrinsic:

__m256 hwvec2 = _mm256_load_ps(&stack_store[0 * _m256_float_step_sz]);
_mm256_store_ps(&stack_store[1 * _m256_float_step_sz], hwvec2);

godbolt

Solution 1:

ISO C++ doesn't define __m256, so we need to look at what does define their behaviour on the implementations that support them.

Intel's intrinsics define vector-pointers like __m256* as being allowed to alias anything else, the same way ISO C++ defines char* as being allowed to alias.

So yes, it's safe to dereference a __m256* instead of using a _mm256_load_ps() aligned-load intrinsic.

But especially for float/double, it's often easier to use the intrinsics because they take care of casting from float*, too. For integers, the AVX512 load/store intrinsics are defined as taking void*, but before that you need an extra (__m256i*) which is just a lot of clutter.

In gcc, this is implemented by defining __m256 with a may_alias attribute: from gcc7.3's avxintrin.h (one of the headers that <immintrin.h> includes):

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */
typedef float __m256 __attribute__ ((__vector_size__ (32),
                                     __may_alias__));
typedef long long __m256i __attribute__ ((__vector_size__ (32),
                                          __may_alias__));
typedef double __m256d __attribute__ ((__vector_size__ (32),
                                       __may_alias__));

/* Unaligned version of the same types.  */
typedef float __m256_u __attribute__ ((__vector_size__ (32),
                                       __may_alias__,
                                       __aligned__ (1)));
typedef long long __m256i_u __attribute__ ((__vector_size__ (32),
                                            __may_alias__,
                                            __aligned__ (1)));
typedef double __m256d_u __attribute__ ((__vector_size__ (32),
                                         __may_alias__,
                                         __aligned__ (1)));

(In case you were wondering, this is why dereferencing a __m256* is like _mm256_store_ps, not storeu.)

GNU C native vectors without may_alias are allowed to alias their scalar type, e.g. even without the may_alias, you could safely cast between float* and a hypothetical v8sf type. But may_alias makes it safe to load from an array of int[], char[], or whatever.

I'm talking about how GCC implements Intel's intrinsics only because that's what I'm familiar with. I've heard from gcc developers that they chose that implementation because it was required for compatibility with Intel.

Other behaviour Intel's intrinsics require to be defined

Using Intel's API for _mm_storeu_si128( (__m128i*)&arr[i], vec); requires you to create potentially-unaligned pointers which would fault if you deferenced them. And _mm_storeu_ps to a location that isn't 4-byte aligned requires creating an under-aligned float*.

Just creating unaligned pointers, or pointers outside an object, is UB in ISO C++, even if you don't dereference them. I guess this allows implementations on exotic hardware which do some kinds of checks on pointers when creating them (possibly instead of when dereferencing), or maybe which can't store the low bits of pointers. (I have no idea if any specific hardware exists where more efficient code is possible because of this UB.)

But implementations which support Intel's intrinsics must define the behaviour, at least for the __m* types and float*/double*. This is trivial for compilers targeting any normal modern CPU, including x86 with a flat memory model (no segmentation); pointers in asm are just integers kept in the same registers as data. (m68k has address vs. data registers, but it never faults from keeping bit-patterns that aren't valid addresses in A registers, as long as you don't deref them.)

Going the other way: element access of a vector.

Note that may_alias, like the char* aliasing rule, only goes one way: it is not guaranteed to be safe to use int32_t* to read a __m256. It might not even be safe to use float* to read a __m256. Just like it's not safe to do char buf[1024]; int *p = (int*)buf;.

Reading/writing through a char* can alias anything, but when you have a char object, strict-aliasing does make it UB to read it through other types. (I'm not sure if the major implementations on x86 do define that behaviour, but you don't need to rely on it because they optimize away memcpy of 4 bytes into an int32_t. You can and should use memcpy to express an unaligned load from a char[] buffer, because auto-vectorization with a wider type is allowed to assume 2-byte alignment for int16_t*, and make code that fails if it's not: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?)

To insert/extract vector elements, use shuffle intrinsics, SSE2 _mm_insert_epi16 / _mm_extract_epi16 or SSE4.1 insert / _mm_extract_epi8/32/64. For float, there are no insert/extract intrinsics that you should use with scalar float.

Or store to an array and read the array. (print a __m128i variable). This does actually optimize away to vector extract instructions.

GNU C vector syntax provides the [] operator for vectors, like __m256 v = ...; v[3] = 1.25;. MSVC defines vector types as a union with a .m128_f32[] member for per-element access.

There are wrapper libraries like Agner Fog's (GPL licensed) Vector Class Library which provide portable operator[] overloads for their vector types, and operator + / - / * / << and so on. It's quite nice, especially for integer types where having different types for different element widths make v1 + v2 work with the right size. (GNU C native vector syntax does that for float/double vectors, and defines __m128i as a vector of signed int64_t, but MSVC doesn't provide operators on the base __m128 types.)

You can also use union type-punning between a vector and an array of some type, which is safe in ISO C99, and in GNU C++, but not in ISO C++. I think it's officially safe in MSVC, too, because I think the way they define __m128 as a normal union.

There's no guarantee you'll get efficient code from any of these element-access methods, though. Do not use inside inner loops, and have a look at the resulting asm if performance matters.

Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

Solution 1:

Other behaviour Intel's intrinsics require to be defined

Going the other way: element access of a vector.

Related

Recent Posts