Bin elements per row - Vectorized 2D Bincount for NumPy
I have a NumPy array with integer values. Values of matrix range from 0 to max element in matrix(in other words, all numbers from 0 to max data element presented in it). I need to build effective( effective means fast fully-vectorized solution) for searching number of elements in each row and encode them according to matrix values.
I could not find a similar question, or a question that somehow helped to solve this.
So if i have this data
in input:
# shape is (N0=4, m0=4)
1 1 0 4
2 4 2 1
1 2 3 5
4 4 4 1
desired output is :
# shape(N=N0, m=data.max()+1):
1 2 0 0 1 0
0 1 2 0 1 0
0 1 1 1 0 1
0 1 0 0 3 0
I know how to solve this by simply counting unique values in each row of data
iterating one by one, and then combining results taking in account all possible values in data
array.
While using NumPy for vectorizing this the key problem is that searching each number one by one is slow and assuming that there are a lot of unique numbers presented, this can not be effective solution. Generally both N
and unique numbers count is rather large(by the way, N
seem to be larger than unique numbers count).
Has somebody have great ideas?)
Well that's basically what does np.bincount
does with 1D
arrays. But, we need to use it on each row iteratively (thinking of it simply). To make it vectorized, we could offset each row by that max number. The idea is to have different bins for each row such that they are not affected by other row elements with same numbers.
Hence, the implementation would be -
# Vectorized solution
def bincount2D_vectorized(a):
N = a.max()+1
a_offs = a + np.arange(a.shape[0])[:,None]*N
return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
Sample run -
In [189]: a
Out[189]:
array([[1, 1, 0, 4],
[2, 4, 2, 1],
[1, 2, 3, 5],
[4, 4, 4, 1]])
In [190]: bincount2D_vectorized(a)
Out[190]:
array([[1, 2, 0, 0, 1, 0],
[0, 1, 2, 0, 1, 0],
[0, 1, 1, 1, 0, 1],
[0, 1, 0, 0, 3, 0]])
Numba Tweaks
We can bring in numba
for further speedups. Now, numba
allows few tweaks.
First off, it allows JIT compilation.
Also, recently they had introduced experimental
parallel
that automatically parallelizes operations in the function known to have parallel semantics.Final tweak would be to use
prange
as a subsititute forrange
. The docs state that this runs loops in parallel, similar to OpenMP parallel for loops and Cython’s prange.prange
performs well with larger datasets, which probably is because of the overhead needed to setup the parallel work.
So, with these new two tweaks along with the njit
for no-Python mode, we would have three variants -
# Numba solutions
def bincount2D_numba(a, use_parallel=False, use_prange=False):
N = a.max()+1
m,n = a.shape
out = np.zeros((m,N),dtype=int)
# Choose fucntion based on args
func = bincount2D_numba_func0
if use_parallel:
if use_prange:
func = bincount2D_numba_func2
else:
func = bincount2D_numba_func1
# Run chosen function on input data and output
func(a, out, m, n)
return out
@njit
def bincount2D_numba_func0(a, out, m, n):
for i in range(m):
for j in range(n):
out[i,a[i,j]] += 1
@njit(parallel=True)
def bincount2D_numba_func1(a, out, m, n):
for i in range(m):
for j in range(n):
out[i,a[i,j]] += 1
@njit(parallel=True)
def bincount2D_numba_func2(a, out, m, n):
for i in prange(m):
for j in prange(n):
out[i,a[i,j]] += 1
For completeness and testing out later on, the loopy version would be -
# Loopy solution
def bincount2D_loopy(a):
N = a.max()+1
m,n = a.shape
out = np.zeros((m,N),dtype=int)
for i in range(m):
out[i] = np.bincount(a[i], minlength=N)
return out
Runtime test
Case #1 :
In [312]: a = np.random.randint(0,100,(100,100))
In [313]: %timeit bincount2D_loopy(a)
...: %timeit bincount2D_vectorized(a)
...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True)
10000 loops, best of 3: 115 µs per loop
10000 loops, best of 3: 36.7 µs per loop
10000 loops, best of 3: 22.6 µs per loop
10000 loops, best of 3: 22.7 µs per loop
10000 loops, best of 3: 39.9 µs per loop
Case #2 :
In [316]: a = np.random.randint(0,100,(1000,1000))
In [317]: %timeit bincount2D_loopy(a)
...: %timeit bincount2D_vectorized(a)
...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True)
100 loops, best of 3: 2.97 ms per loop
100 loops, best of 3: 3.54 ms per loop
1000 loops, best of 3: 1.83 ms per loop
100 loops, best of 3: 1.78 ms per loop
1000 loops, best of 3: 1.4 ms per loop
Case #3 :
In [318]: a = np.random.randint(0,1000,(1000,1000))
In [319]: %timeit bincount2D_loopy(a)
...: %timeit bincount2D_vectorized(a)
...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False)
...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True)
100 loops, best of 3: 4.01 ms per loop
100 loops, best of 3: 4.86 ms per loop
100 loops, best of 3: 3.21 ms per loop
100 loops, best of 3: 3.18 ms per loop
100 loops, best of 3: 2.45 ms per loop
Seems like the numba
variants are performing very well. Choosing one out of the three variants would depend on the input array shape parameters and to some extent on the number of unique elements in it.