How much overhead is there in calling a function in C++?
A lot of literature talks about using inline functions to "avoid the overhead of a function call". However I haven't seen quantifiable data. What is the actual overhead of a function call i.e. what sort of performance increase do we achieve by inlining functions?
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpage has a description of what's involved in the various calling conventions.
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
I made a simple benchmark against a simple increment function:
inc.c:
typedef unsigned long ulong;
ulong inc(ulong x){
return x+1;
}
main.c
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
#ifdef EXTERN
ulong inc(ulong);
#else
static inline ulong inc(ulong x){
return x+1;
}
#endif
int main(int argc, char** argv){
if (argc < 1+1)
return 1;
ulong i, sum = 0, cnt;
cnt = atoi(argv[1]);
for(i=0;i<cnt;i++){
sum+=inc(i);
}
printf("%lu\n", sum);
return 0;
}
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz gave me:
- 1.4 seconds for the inlinining version
- 4.4 seconds for the regularly linked version
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).