In general, if you are facing a design situation that gives you a choice, use templates. I stressed the word design because I think what you need to focus on is the distinction between the use cases of std::function and templates, which are pretty different.

In general, the choice of templates is just an instance of a wider principle: try to specify as many constraints as possible at compile-time. The rationale is simple: if you can catch an error, or a type mismatch, even before your program is generated, you won't ship a buggy program to your customer.

Moreover, as you correctly pointed out, calls to template functions are resolved statically (i.e. at compile time), so the compiler has all the necessary information to optimize and possibly inline the code (which would not be possible if the call were performed through a vtable).

Yes, it is true that template support is not perfect, and C++11 is still lacking a support for concepts; however, I don't see how std::function would save you in that respect. std::function is not an alternative to templates, but rather a tool for design situations where templates cannot be used.

One such use case arises when you need to resolve a call at run-time by invoking a callable object that adheres to a specific signature, but whose concrete type is unknown at compile-time. This is typically the case when you have a collection of callbacks of potentially different types, but which you need to invoke uniformly; the type and number of the registered callbacks is determined at run-time based on the state of your program and the application logic. Some of those callbacks could be functors, some could be plain functions, some could be the result of binding other functions to certain arguments.

std::function and std::bind also offer a natural idiom for enabling functional programming in C++, where functions are treated as objects and get naturally curried and combined to generate other functions. Although this kind of combination can be achieved with templates as well, a similar design situation normally comes together with use cases that require to determine the type of the combined callable objects at run-time.

Finally, there are other situations where std::function is unavoidable, e.g. if you want to write recursive lambdas; however, these restrictions are more dictated by technological limitations than by conceptual distinctions I believe.

To sum up, focus on design and try to understand what are the conceptual use cases for these two constructs. If you put them into comparison the way you did, you are forcing them into an arena they likely don't belong to.


Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to std::function.

First of all, a quick remark on the measurement technique: The 11ms obtained for calc1 has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of calling calc1 is independent of the iteration and moves the call out of the loop:

for (int i = 0; i < 1e8; ++i) {
}
calc1([](float arg){ return arg * 0.5f; });

Furthermore, it realises that calling calc1 has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.

As it has been pointed out, the optimizer has more troubles to understand std::function and doesn't move the call out of the loop. So 1241ms is a fair measurement for calc2.

Notice that, std::function is able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call to new). It's well known that this is a quite costly operation.

The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).

To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three floats. This makes the callable object too big to apply the small object optimization:

float a, b, c; // never mind the values
// ...
calc2([a,b,c](float arg){ return arg * 0.5f; });

For this version, the time is approximately 16000ms (compared to 1241ms for the original code).

Finally, notice that the lifetime of the lambda encloses that of the std::function. In this case, rather than storing a copy of the lambda, std::function could store a "reference" to it. By "reference" I mean a std::reference_wrapper which is easily build by functions std::ref and std::cref. More precisely, by using:

auto func = [a,b,c](float arg){ return arg * 0.5f; };
calc2(std::cref(func));

the time decreases to approximately 1860ms.

I wrote about that a while ago:

http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059

As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.


With Clang there's no performance difference between the two

Using clang (3.2, trunk 166872) (-O2 on Linux), the binaries from the two cases are actually identical.

-I'll come back to clang at the end of the post. But first, gcc 4.7.2:

There's already a lot of insight going on, but I want to point out that the result of the calculations of calc1 and calc2 are not the same, due to in-lining etc. Compare for example the sum of all results:

float result=0;
for (int i = 0; i < 1e8; ++i) {
  result+=calc2([](float arg){ return arg * 0.5f; });
}

with calc2 that becomes

1.71799e+10, time spent 0.14 sec

while with calc1 it becomes

6.6435e+10, time spent 5.772 sec

that's a factor of ~40 in speed difference, and a factor of ~4 in the values. The first is a much bigger difference than what OP posted (using visual studio). Actually printing out the value a the end is also a good idea to prevent the compiler to removing code with no visible result (as-if rule). Cassio Neri already said this in his answer. Note how different the results are -- One should be careful when comparing speed factors of codes that perform different calculations.

Also, to be fair, comparing various ways of repeatedly calculating f(3.3) is perhaps not that interesting. If the input is constant it should not be in a loop. (It's easy for the optimizer to notice)

If I add a user supplied value argument to calc1 and 2 the speed factor between calc1 and calc2 comes down to a factor of 5, from 40! With visual studio the difference is close to a factor of 2, and with clang there is no difference (see below).

Also, as multiplications are fast, talking about factors of slow-down is often not that interesting. A more interesting question is, how small are your functions, and are these calls the bottleneck in a real program?

Clang:

Clang (I used 3.2) actually produced identical binaries when I flip between calc1 and calc2 for the example code (posted below). With the original example posted in the question both are also identical but take no time at all (the loops are just completely removed as described above). With my modified example, with -O2:

Number of seconds to execute (best of 3):

clang:        calc1:           1.4 seconds
clang:        calc2:           1.4 seconds (identical binary)

gcc 4.7.2:    calc1:           1.1 seconds
gcc 4.7.2:    calc2:           6.0 seconds

VS2012 CTPNov calc1:           0.8 seconds 
VS2012 CTPNov calc2:           2.0 seconds 

VS2015 (14.0.23.107) calc1:    1.1 seconds 
VS2015 (14.0.23.107) calc2:    1.5 seconds 

MinGW (4.7.2) calc1:           0.9 seconds
MinGW (4.7.2) calc2:          20.5 seconds 

The calculated results of all binaries are the same, and all tests were executed on the same machine. It would be interesting if someone with deeper clang or VS knowledge could comment on what optimizations may have been done.

My modified test code:

#include <functional>
#include <chrono>
#include <iostream>

template <typename F>
float calc1(F f, float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

float calc2(std::function<float(float)> f,float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

int main() {
    using namespace std::chrono;

    const auto tp1 = high_resolution_clock::now();

    float result=0;
    for (int i = 0; i < 1e8; ++i) {
      result=calc1([](float arg){ 
          return arg * 0.5f; 
        },result);
    }
    const auto tp2 = high_resolution_clock::now();

    const auto d = duration_cast<milliseconds>(tp2 - tp1);  
    std::cout << d.count() << std::endl;
    std::cout << result<< std::endl;
    return 0;
}

Update:

Added vs2015. I also noticed that there are double->float conversions in calc1,calc2. Removing them does not change the conclusion for visual studio (both are a lot faster but the ratio is about the same).


Different isn't the same.

It's slower because it does things that a template can't do. In particular, it lets you call any function that can be called with the given argument types and whose return type is convertible to the given return type from the same code.

void eval(const std::function<int(int)>& f) {
    std::cout << f(3);
}

int f1(int i) {
    return i;
}

float f2(double d) {
    return d;
}

int main() {
    std::function<int(int)> fun(f1);
    eval(fun);
    fun = f2;
    eval(fun);
    return 0;
}

Note that the same function object, fun, is being passed to both calls to eval. It holds two different functions.

If you don't need to do that, then you should not use std::function.