Portable loop unrolling with template parameter in C++ with GCC/ICC

I am working on a high-performance parallel computational fluid dynamics code that involves a lot of lightweight loops and therefore gains approximately 30% in performance if all important loops are fully unrolled.

This can be done easily for a fixed number of loops by using compiler directives: #pragma GCC unroll (16) is recognized by both compilers I am aiming for, the Intel C++ compiler ICC and GCC, while #pragma unroll (16) is sadly ignored by GCC. I can also use template parameters or pre-preprocessor directives as as limits with ICC (similar to what you can do with nvcc), for instance

template <int N>
// ...
#pragma unroll (N)
for (int i = 0; i < N; ++i) {
// ...
}

or

#define N 16

#pragma unroll (N)
for (int i = 0; i < N; ++i) {
// ...
}

throw no error or warning with -Wall -w2 -w3 when compiling with ICC while the complementary syntax #pragma GCC unroll (N) with GCC (-Wall -pedantic) throws an error in GCC 9.2.1 20191102 in Ubuntu 18.04:

error: ‘#pragma GCC unroll’ requires an assignment-expression that evaluates to a non-negative integral constant less than 65535
#pragma GCC unroll (N)

Is somebody aware of a way to make loop unrolling based on a template parameter with compiler directives work in a portable way (at least working with GCC and ICC)? I actually only need full unrolling of the entire loop, so something like #pragma GCC unroll (all) would already help me a lot.

I am aware that there exist more or less complex strategies to unroll loops with template meta-programming but as in my application the loops might be nested and can contain more complicated loop bodies, I feel like such a strategy would over-complicate my code and reduce readibility.


Solution 1:

Sadly currently there does not seem to be a consistent way to do so.

I ended up using a pre-processor macro in combination with _Pragma(string-literal) and a stringification macro (similar to this) that decides whether to unroll depending on the template parameter if the Intel C++ Compiler ICC or Clang is available, to use a constant factor for unrolling for GCC and ignore it for any other compiler. Here a small example:

/// Helper macros for stringification
#define TO_STRING_HELPER(X)   #X
#define TO_STRING(X)          TO_STRING_HELPER(X)

// Define loop unrolling depending on the compiler
#if defined(__ICC) || defined(__ICL)
  #define UNROLL_LOOP(n)      _Pragma(TO_STRING(unroll (n)))
#elif defined(__clang__)
  #define UNROLL_LOOP(n)      _Pragma(TO_STRING(unroll (n)))
#elif defined(__GNUC__) && !defined(__clang__)
  #define UNROLL_LOOP(n)      _Pragma(TO_STRING(GCC unroll (16)))
#elif defined(_MSC_BUILD)
  #pragma message ("Microsoft Visual C++ (MSVC) detected: Loop unrolling not supported!")
  #define UNROLL_LOOP(n)
#else
  #warning "Unknown compiler: Loop unrolling not supported!"
  #define UNROLL_LOOP(n)
#endif

/// Example usage
template <int N>
void exampleContainingLoop() {
  UNROLL_LOOP(N)
  for (int i = 0; i < N; ++i) {
    // ...
  }
    
  return;
}