Why is numpy slower than for loop
Solution 1:
With NumPy, one must look to vectorize things and we could certainly do so here.
Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples
twice with that loop startup :
for i, j in itertools.product(range(n_sample), range(n_sample)):
We could convert these iterations into vectorized operations, once we let broadcasting
handle those.
Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M)
, where (N,M)
is the shape of the input data.
Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2
elements for the given sample. So, the idea that comes out is that we could run a loop along M
, such that at each iteration, we would compute the prod
and accumulate. Thus, for the given sample, its just two loop iterations.
Getting out of the loop, we would have the accumulated prod
, that just needs summation for disc2
as the final output.
Here's an implementation to fulfil the above mentioned ideas -
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
Runtime test
The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :
def org_app(sample):
disc2 = 0
n_sample = len(sample)
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
return disc2
def mod_app(sample):
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
return disc2
Timings and verification -
In [10]: sample = np.random.random_sample((100, 2))
In [11]: org_app(sample)
Out[11]: 11934.878683659041
In [12]: mod_app(sample)
Out[12]: 11934.878683659068
In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop
In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop
About 900x
speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.
Solution 2:
As I mentioned in the comments your solutions are not really optimal and it doesn't really make sense to compare not-ideal approaches.
For one thing iterating or indexing single elements of a NumPy array is really slow. I recently answered a question including a lot of details (if you're interested you might have a look at it: "convert np array to a set takes too long"). So the Python approach could be faster simply by converting the array
to a list
:
def func():
sample = np.random.random_sample((100, 2))
disc1 = 0
n_sample = len(sample)
dim = sample.shape[1]
sample = sample.tolist() # converted to list
for i in range(n_sample):
prod = 1
for item in sample[i]:
sub = abs(item - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
disc2 = 0
for i, j in itertools.product(range(n_sample), range(n_sample)):
prod = 1
for k in range(dim):
a = 0.5 * abs(sample[i][k] - 0.5)
b = 0.5 * abs(sample[j][k] - 0.5)
c = 0.5 * abs(sample[i][k] - sample[j][k])
prod *= 1 + a + b - c
disc2 += prod
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
I also replaced the np.abs
calls with normal abs
. The normal abs
has a lower overhead! And also changed some other parts. In the end this is more than 10-20 times faster than your original "normal" approach.
I didn't have time to check the NumPy approach yet and @Divarkar already included a really good and optimized approach. Comparing the two approaches:
def func_numpy():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
disc1 = np.sum(np.prod(1 +
0.5 * np.abs(sample - 0.5) -
0.5 * np.abs(sample - 0.5) ** 2,
axis=1))
prod_arr = 1
for i in range(sample.shape[1]):
s0 = sample[:,i]
prod_arr *= (1 +
0.5 * np.abs(s0[:,None] - 0.5) +
0.5 * np.abs(s0 - 0.5) -
0.5 * np.abs(s0[:,None] - s0))
disc2 = prod_arr.sum()
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
print('Normal function time: ' ,
timeit.repeat('func()', number=20, repeat=3, setup="from __main__ import func"))
# Normal function time: [1.4846746248249474, 1.5018398493266432, 1.5476674017127152]
print('numpy function time: ',
timeit.repeat('func_numpy()', number=20, repeat=3, setup="from __main__ import func_numpy"))
# numpy function time: [0.020140038561976326, 0.016502230831292763, 0.016452520269695015]
So an optimized NumPy approach can definetly beat an "optimized" Python approach. It's almost 100 times faster. In case you want it even faster you could use numba on a slightly modified version of the pure python code:
import numba as nb
@nb.njit
def func_numba():
sample = np.random.random_sample((100, 2))
disc1 = 0
n_sample = len(sample)
dim = sample.shape[1]
for i in range(n_sample):
prod = 1
for item in sample[i]:
sub = abs(item - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
disc2 = 0
for i in range(n_sample):
for j in range(n_sample):
prod = 1
for k in range(dim):
a = 0.5 * abs(sample[i,k] - 0.5)
b = 0.5 * abs(sample[j,k] - 0.5)
c = 0.5 * abs(sample[i,k] - sample[j,k])
prod *= 1 + a + b - c
disc2 += prod
return (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
func_numba()
print('numba function time: ' ,
timeit.repeat('func_numba()', number=20, repeat=3, setup="from __main__ import func_numba"))
# numba function time: [0.003022848984983284, 0.0030429566279508435, 0.004060626777572907]
That's almost a factor 8-10 faster than the NumPy approach.