Arm Neon Intrinsics vs hand assembly

Solution 1:

My experience is that the intrinsics haven't really been worth the trouble. It's too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I've seen this kind of stuff in pretty recent compilers (including clang 3.1).

At this level, I find you really need to control exactly what's happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder's gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren't good enough. Maybe others have difference experiences here.

Solution 2:

I've had to use NEON intrinsics in several projects for portability. The truth is that GCC doesn't generate good code from NEON intrinsics. This is not a weakness of using intrinsics, but of the GCC tools. The ARM compiler from Microsoft produces great code from NEON intrinsics and there is no need to use assembly language in that case. Portability and practicality will dictate which you should use. If you can handle writing assembly language then write asm. For my personal projects I prefer to write time-critical code in ASM so that I don't have to worry about a buggy/inferior compiler messing up my code.

Update: The Apple LLVM compiler falls in between GCC (worst) and Microsoft (best). It doesn't do great with instruction interleaving nor optimal register usage, but at least it generates reasonable code (unlike GCC in some situations).

Update2: The Apple LLVM compiler for ARMv8 has been improved dramatically. It now does a great job generating ARMv8 code from C and intrinsics.

Solution 3:

So this question is four years old, now, and still shows up in search results...

In 2016 things are much better.

A lot of simple code that I've transcribed from assembly to intrinsics is now optimised better by the compilers than by me because I'm too lazy to do the pipeline work (for how many different pipelines now?), while the compilers just needs me to pass the right --mtune=.

For complex code where register allocation can get tight, GCC and Clang both can still produce slower than handwritten code by a factor of two... or three(ish). It's mostly on register spills, so you should know from the structure of your code whether that's a risk.

But they both sometimes have disappointing accidents. I'd say that right now that's worth the risk (although I'm paid to take risk), and if you do get hit by something then file a bug. That way things will keep on getting better.