Why is Python recursion so expensive and what can we do about it?
Solution 1:
The issue is that Python has an internal limit on number of recursive function calls.
That limit is configurable as shown in Quentin Coumes' answer. However, too deep a function chain will result in a stack overflow. This underlying limitation¹ applies to both C++ and Python. This limitation also applies to all function calls, not just recursive ones.
In general: You should not write² algorithms that have recursion depth growth with linear complexity or worse. Logarithmically growing recursion is typically fine. Tail-recursive functions are trivial to re-write as iterations. Other recursions may be converted to iteration using external data structures (usually, a dynamic stack).
A related rule of thumb is that you shouldn't have large objects with automatic storage. This is C++-specific since Python doesn't have the concept of automatic storage.
¹ The underlying limitation is the execution stack size. The default size differs between systems, and different function calls consume different amounts of memory, so the limit isn't specified as a number of calls but in bytes instead. This too is configurable on some systems. I wouldn't typically recommend touching that limit due to portability concerns.
² Exceptions to this rule of thumb are certain functional languages that guarantee tail-recursion elimination - such as Haskell - where that rule can be relaxed in case of recursions that are guaranteed to be eliminated. Python is not such a language, and the function in question isn't tail-recursive. While C++ compilers can perform the elimination as an optimization, it isn't guaranteed, and is typically not optimized in debug builds. Hence, the exception doesn't generally apply to C++ either.
Disclaimer: The following is my hypothesis; I don't actually know their rationale: The Python limit is probably a feature that detects potentially infinite recursions, preventing likely unsafe stack overflow crashes and substituting a more controlled RecursionError
.
Why are recursive calls so much cheaper in C++?
C++ is a compiled language. Python is interpreted. (Nearly) everything is cheaper in C++, except the translation from source code to an executable program.
Solution 2:
Let me first answer your direct questions:
Why are recursive calls so much cheaper in C++?
Because C++ has no limitation on recursive call depth, except the size of the stack. And being a fully compiled language, loops (including recursion) are much faster in C++ than in Python (the reason why special Python modules like numpy/scipy directly use C routines). Additionally, most C++ implementations use a special feature called tail recursion elimination (see later in this post) and optimize some recursive code into iterative equivalents. This is nice here but not guaranteed by the standard, so other compilations could result in a program crashing miserably - but tail recursion is probably not involved here.
If the recursion is too deep and exhausts the available stack, you will invoke the well-known Undefined Behaviour where anything can happen, from an immediate crash to a program giving wrong results (IMHO the latter is much worse and cannot be detected...)
Can we somehow modify the Python compiler to make it more receptive to recursion?
No. Python implementation explicitly never uses tail recursion elimination. You could increase the recursion limit, but this is almost always a bad idea (see later in this post why).
Now for the true explanation of the underlying rationale.
Deep recursion is evil, full stop. You should never use it. Recursion is a handy tool when you can make sure that the depth will stay in sane limits. Python uses a soft limit to warn the programmer that something is going wrong before crashing the system. On the other hand, optimizing C and C++ compilers often internally change tail recursion into an iterative loop. But relying on it is highly dangerous because a slight change could prevent that optimization and cause an application crash.
As found in this other SO post, common Python implementations do not implement that tail recursion elimination. So you should not use recursion at a 5000 depth but instead use an iterative algorithm.
As your underlying computation will need all Fibonacci numbers up to the specified one, it is not hard to iteratively compute them. Furthermore, it will be much more efficient!
Solution 3:
A solution is a trampoline: the recursive function, instead of calling another function, returns a function that makes that call with the appropriate arguments. There's a loop one level higher that calls all those functions in a loop until we have the final result. I'm probably not explaining it very well; you can find resources online that do a better job.
The point is that this converts recursion to iteration. I don't think this is faster, maybe it's even slower, but the recursion depth stays low.
An implementation could look like below. I split the pair x
into a
and b
for clarity. I then converted the recursive function to a version that keeps track of a
and b
as arguments, making it tail recursive.
def fib_acc(n, a, b):
if n == 1:
return (a, b)
return lambda: fib_acc(n - 1, (a+b) % 997, (a+2*b) % 997)
def fib(n):
x = fib_acc(n, 1, 2)
while callable(x):
x = x()
return x
if __name__=='__main__':
print(fib(50000)[0])
Solution 4:
"Both will find the answer 996 without issues"
I do see at least one issue : the answer should be 836, not 996.
It seems that both your functions calculate Fibonacci(2*n) % p
, and not Fibonacci(n) % p
.
996 is the result of Fibonacci(1000) % 997
.
Pick a more efficient algorithm
An inefficient algorithm stays an inefficient algorithm, regardless if it's written in C++ or Python.
In order to compute large Fibonacci numbers, there are much faster methods than simple recursion with O(n) calls (see related article).
For large n, this recursive O(log n) Python function should run in circles around your above C++ code:
from functools import lru_cache
@lru_cache(maxsize=None)
def fibonacci(n, p):
"Calculate Fibonacci(n) modulo p"
if n < 3:
return [0, 1, 1][n]
if n % 2 == 0:
m = n // 2
v1 = fibonacci(m - 1, p)
v2 = fibonacci(m, p)
return (2*v1 + v2) * v2 % p
else:
m = (n + 1) // 2
v1 = fibonacci(m, p) ** 2
v2 = fibonacci(m - 1, p) ** 2
return (v1 + v2) % p
print(fibonacci(500, 997))
#=> 836
print(fibonacci(1000, 997))
#=> 996
Try it online!
It will happily calculate fibonacci(10_000_000_000_000_000, 997)
.
It's possible to add recursion level as parameter, in order to see how deep the recursion needs to go, and display it with indentation. Here's an example for n=500
:
# Recursion tree:
500
249
124
61
30
14
6
2
3
1
2
7
4
15
8
31
16
62
125
63
32
250
Try it online!
Your examples would simply look like very long diagonals:
500
499
498
...
...
1
Solution 5:
For Windows executables, the stack size is specified in the header of the executable. For the Windows version of Python 3.7 x64, that size is 0x1E8480 or exactly 2.000.000 bytes.
That version crashes with
Process finished with exit code -1073741571 (0xC00000FD)
and if we look that up we find that it's a Stack Overflow.
What we can see on the (native) stack with a native debugger like WinDbg (enable child process debugging) is
[...]
fa 000000e9`6da1b680 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
fb 000000e9`6da1b740 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
fc 000000e9`6da1b980 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
fd 000000e9`6da1ba40 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
fe 000000e9`6da1bc80 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
ff 000000e9`6da1bd40 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
2:011> ? 000000e9`6da1bd40 - 000000e9`6da1ba40
Evaluate expression: 768 = 00000000`00000300
So Python will use 2 Stack frames per method call and there's an enormous 768 bytes difference in the stack positions.
If you modify that value inside the EXE (make a backup copy) with a hex editor to, let's say 256 MB
you can run the following code
[...]
if __name__=='__main__':
sys.setrecursionlimit(60000)
print(fib(50000)[0])
and it will give 151
as the answer.
In C++, we can also force a Stack Overflow, e.g. by passing 500.000 as the parameter. While debugging, we get
0:000> .exr -1
ExceptionAddress: 00961015 (RecursionCpp!fib+0x00000015)
ExceptionCode: c00000fd (Stack overflow)
[...]
0:000> k
[...]
fc 00604f90 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7]
fd 00604fb0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7]
fe 00604fd0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7]
ff 00604ff0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7]
0:000> ? 00604ff0 - 00604fd0
Evaluate expression: 32 = 00000020
which is just 1 stack frame per method call and only 32 bytes difference on the stack. Compared to Python, C++ can do 768/32 = 24x more recursions for the same stack size.
My Microsoft compiler created the executable with the default stack size of 1 MB (Release build, 32 Bit):
The 64 bit version has a stack difference of 64 bit (also Release build).
Tools used:
- Microsoft WinDbg Preview (free)
- Sweetscape 010 Editor (commercial) with the template for PE files