numpy float: 10x slower than builtin in arithmetic operations?
I am getting really weird timings for the following code:
import numpy as np
s = 0
for i in range(10000000):
s += np.float64(1) # replace with np.float32 and built-in float
- built-in float: 4.9 s
- float64: 10.5 s
- float32: 45.0 s
Why is float64
twice slower than float
? And why is float32
5 times slower than float64?
Is there any way to avoid the penalty of using np.float64
, and have numpy
functions return built-in float
instead of float64
?
I found that using numpy.float64
is much slower than Python's float, and numpy.float32
is even slower (even though I'm on a 32-bit machine).
numpy.float32
on my 32-bit machine. Therefore, every time I use various numpy functions such as numpy.random.uniform
, I convert the result to float32
(so that further operations would be performed at 32-bit precision).
Is there any way to set a single variable somewhere in the program or in the command line, and make all numpy functions return float32
instead of float64
?
EDIT #1:
numpy.float64 is 10 times slower than float in arithmetic calculations. It's so bad that even converting to float and back before the calculations makes the program run 3 times faster. Why? Is there anything I can do to fix it?
I want to emphasize that my timings are not due to any of the following:
- the function calls
- the conversion between numpy and python float
- the creation of objects
I updated my code to make it clearer where the problem lies. With the new code, it would seem I see a ten-fold performance hit from using numpy data types:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
- float64: 34.56s
- float32: 35.11s
- float: 3.53s
Just for the hell of it, I also tried:
from datetime import datetime import numpy as np
START_TIME = datetime.now()
s = np.float64(1)
for i in range(10000000):
s = float(s)
s = (s + 8) * s % 2399232
s = np.float64(s)
print(s)
print('Runtime:', datetime.now() - START_TIME)
The execution time is 13.28 s; it's actually 3 times faster to convert the float64
to float
and back than to use it as is. Still, the conversion takes its toll, so overall it's more than 3 times slower compared to the pure-python float
.
My machine is:
- Intel Core 2 Duo T9300 (2.5GHz)
- WinXP Professional (32-bit)
- ActiveState Python 3.1.3.5
- Numpy 1.5.1
EDIT #2:
Thank you for the answers, they help me understand how to deal with this problem.
But I still would like to know the precise reason (based on the source code perhaps) why the code below runs 10 times slow with float64
than with float
.
EDIT #3:
I rerun the code under the Windows 7 x64 (Intel Core i7 930 @ 3.8GHz).
Again, the code is:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
- float64: 16.1s
- float32: 16.1s
- float: 3.2s
Now both np
floats (either 64 or 32) are 5 times slower than the built-in float
. Still, a significant difference. I'm trying to figure out where it comes from.
END OF EDITS
CPython floats are allocated in chunks
The key problem with comparing numpy scalar allocations to the float
type is that CPython always allocates the memory for float
and int
objects in blocks of size N.
Internally, CPython maintains a linked list of blocks each large enough to hold N float
objects. When you call float(1)
CPython checks if there is space available in the current block; if not it allocates a new block. Once it has space in the current block it simply initializes that space and returns a pointer to it.
On my machine each block can hold 41 float
objects, so there is some overhead for the first float(1)
call but the next 40 run much faster as the memory is allocated and ready.
Slow numpy.float32 vs. numpy.float64
It appears that numpy has 2 paths it can take when creating a scalar type: fast and slow. This depends on whether the scalar type has a Python base class to which it can defer for argument conversion.
For some reason numpy.float32
is hard-coded to take the slower path (defined by the _WORK0
macro), while numpy.float64
gets a chance to take the faster path (defined by the _WORK1
macro). Note that scalartypes.c.src
is a template which generates scalartypes.c
at build time.
You can visualize this in Cachegrind. I've included screen captures showing how many more calls are made to construct a float32
vs. float64
:
float64
takes the fast path
float32
takes the slow path
Updated - Which type takes the slow/fast path may depend on whether the OS is 32-bit vs 64-bit. On my test system, Ubuntu Lucid 64-bit, the float64
type is 10 times faster than float32
.
Operating with Python objects in a heavy loop like that, whether they are float
, np.float32
, is always slow. NumPy is fast for operations on vectors and matrices, because all of the operations are performed on big chunks of data by parts of the library written in C, and not by the Python interpreter. Code run in the interpreter and/or using Python objects is always slow, and using non-native types makes it even slower. That's to be expected.
If your app is slow and you need to optimize it, you should try either converting your code to a vector solution that uses NumPy directly, and is fast, or you could use tools such as Cython to create a fast implementation of the loop in C.
Perhaps, that is why you should use Numpy directly instead of using loops.
s1 = np.ones(10000000, dtype=np.float)
s2 = np.ones(10000000, dtype=np.float32)
s3 = np.ones(10000000, dtype=np.float64)
np.sum(s1) <-- 17.3 ms
np.sum(s2) <-- 15.8 ms
np.sum(s3) <-- 17.3 ms
The answer is quite simple: the memory allocation might be part of it, but the biggest problem is that arithmetic operations for numpy scalars is done using "ufuncs" which are meant to be fast for several hundred values not just 1. There is some overhead in choosing the correct function to call and setting up the loops. Overhead which is un-necessary for scalars.
It was easier to just have the scalars be converted to 0-d arrays and then passed to the corresponding numpy ufunc then write separate calculation methods for each of the many different scalar types that NumPy supports.
The intent was that optimized versions of the scalar math would be added to the type-objects in C. This could still happen, but it never has happened because no-one has been motivated enough to do it. Possibly because the work-around is to convert numpy scalars to Python scalars which do have optimized arithmetic.