cpu dispatcher for visual studio for AVX and SSE

I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path. I've follow the suggestions by Agner Fog to make a CPU dispatcher (http://www.agner.org/optimize/#vectorclass). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it.

I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX instruction set defined and with some AVX instructions. In my main function if I only reference the SSE2 functions the code still crashes by virtue of having any source code with AVX enabled and with AVX instructions. Any clues to how I can fix this?

Edit: Okay, I think I isolated the problem. I'm using Agner Fog's vector class and I have defined three source files as:

//file sse2.cpp - compiled with /arch:SSE2
#include "vectorclass.h"
float func_sse2(const float* a) {
    Vec8f v1 = Vec8f().load(a);
    float sum = horizontal_add(v1);
    return sum;
}
//file avx.cpp - compiled with /arch:AVX
#include "vectorclass.h"
float func_avx(const float* a) {
    Vec8f v1 = Vec8f().load(a);
    float sum = horizontal_add(v1);
    return sum;
}
//file foo.cpp - compiled with /arch:SSE2
#include <stdio.h>
extern float func_sse2(const float* a);
extern float func_avx(const float* a);
int main() {
    float (*fp)(const float*a); 
    float a[] = {1,2,3,4,5,6,7,8};
    int iset = 6;
    if(iset>=7) { 
        fp = func_avx;  
    }
    else { 
        fp = func_sse2;
    }
    float sum = (*fp)(a);
    printf("sum %f\n", sum);
}

This crashes. If I instead use Vec4f in func_SSE2 it does not crash. I don't understand this. I can use Vec8f with SSE2 by itself as long as I don't have another source file with AVX. Agner Fog's manual says

"There is no advantage in using the 256-bit floating point vector classes (Vec8f, Vec4d) unless the AVX instruction set is specified, but it can be convenient to use these classes anyway if the same source code is used with and without AVX. Each 256-bit vector will simply be split up into two 128-bit vectors when compiling without AVX."

However, when I have two source files with Vec8f one compiled with SSE2 and one compiled with AVX then I get a crash.

Edit2: I can get it to work from the command line

>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp sse2.obj avx.obj
>foo.exe

Edit3: This, however, crashes

>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp avx.obj sse2.obj
>foo.exe

Another clue. Apparently, the order of linking matters. It crashes if avx.obj is before sse2.obj but if sse2.obj is before avx.obj it does not crash. I'm not sure if it chooses the correct code path (I don't have access to my AVX system right now) but at least it does not crash.


I realise that this is an old question and that the person who asked it appears to be no longer around, but I hit the same problem yesterday. Here's what I worked out.

When compiled both your sse2.cpp and avx.cpp files produce object files that not only contain your function but also any required template functions. (e.g. Vec8f::load) These template functions are also compiled using the requested instruction set.

The means that your sse2.obj and avx.obj object files will both contain definitions of Vec8f::load each compiled using the respective instruction sets.

However, since the compiler treats Vec8f::load as externally visible, it puts it a 'COMDAT' section of the object file with a 'selectany' (aka 'pick any') label. This tells the linker that if it sees multiple definitions of this symbol, for example in 2 different object files, then it is allowed to pick any one it likes. (It does this to reduce duplicate code in the final executable which otherwise would be inflated in size by multiple definitions of template and inline functions.)

The problem you are having is directly related to this in that the order of the object files passed to the linker is affecting which one it picks. Specifically here, it appears to be picking the first definition it sees.

If this was avx.obj then the AVX compiled version of Vec8F::load will always be used. This will crash on a machine that doesn't support that instruction set. On the other hand if sse2.obj is first then the SSE2 compiled version will always be used. This won't crash but it will only use SSE2 instructions even if AVX is supported.

That this is the case can be seen if you look at the linker 'map' file output (produced using the /map option.) Here are the relevant (edited) excerpts -

//
// link with sse2.obj before avx.obj
//
0001:00000080  _main                             foo.obj
0001:00000330  func_sse2@@YAMPBM@Z               sse2.obj
0001:00000420  ??0Vec256fe@@QAE@XZ               sse2.obj
0001:00000440  ??0Vec4f@@QAE@ABT__m128@@@Z       sse2.obj
0001:00000470  ??0Vec8f@@QAE@XZ                  sse2.obj <-- sse2 version used
0001:00000490  ??BVec4f@@QBE?AT__m128@@XZ        sse2.obj
0001:000004c0  ?get_high@Vec8f@@QBE?AVVec4f@@XZ  sse2.obj
0001:000004f0  ?get_low@Vec8f@@QBE?AVVec4f@@XZ   sse2.obj
0001:00000520  ?load@Vec8f@@QAEAAV1@PBM@Z        sse2.obj <-- sse2 version used
0001:00000680  ?func_avx@@YAMPBM@Z               avx.obj
0001:00000740  ??BVec8f@@QBE?AT__m256@@XZ        avx.obj

//
// link with avx.obj before sse2.obj
//
0001:00000080  _main                             foo.obj
0001:00000270  ?func_avx@@YAMPBM@Z               avx.obj
0001:00000330  ??0Vec8f@@QAE@XZ                  avx.obj <-- avx version used
0001:00000350  ??BVec8f@@QBE?AT__m256@@XZ        avx.obj
0001:00000380  ?load@Vec8f@@QAEAAV1@PBM@Z        avx.obj <-- avx version used
0001:00000580  ?func_sse2@@YAMPBM@Z              sse2.obj
0001:00000670  ??0Vec256fe@@QAE@XZ               sse2.obj
0001:00000690  ??0Vec4f@@QAE@ABT__m128@@@Z       sse2.obj
0001:000006c0  ??BVec4f@@QBE?AT__m128@@XZ        sse2.obj
0001:000006f0  ?get_high@Vec8f@@QBE?AVVec4f@@XZ  sse2.obj
0001:00000720  ?get_low@Vec8f@@QBE?AVVec4f@@XZ   sse2.obj

As for fixing it, that's another matter. In this case, the following blunt hack should work by forcing the avx version to have its own differently named versions of the template functions. This will increase the resulting executable size as it will contain multiple versions of the same function even if the sse2 and avx versions are identical.

// avx.cpp
namespace AVXWrapper {
\#include "vectorclass.h"
}
using namespace AVXWrapper;

float func_avx(const float* a)
{
    ...
}

There are some important limitations though - (a) if the included file manages any form of global state it will no longer be truly global as you will have 2 'semi-global' versions, and (b) you won't be able to pass vectorclass variables as parameters between other code and functions defined in avx.cpp.


The fact that the link order matters makes me think that there might be some kind of initialization code in the obj file. If the initialization code is communal, then only the first one is taken. I can't reproduce it, but you should be able to see it in an assembly listing (compile with /c /Ftestavx.asm)


Put the SSE and AVX functions in different CPP files and be sure to compile SSE version wihout /arch:AVX.