Efficiently replace elements in array based on dictionary - NumPy / Python
First, of all, my apologies if this has been answered elsewhere. All I could find were questions about replacing elements of a given value, not elements of multiple values.
background
I have several thousand large np.arrays, like so:
# generate dummy data
input_array = np.zeros((100,100))
input_array[0:10,0:10] = 1
input_array[20:56, 21:43] = 5
input_array[34:43, 70:89] = 8
In those arrays, I want to replace values, based on a dictionary:
mapping = {1:2, 5:3, 8:6}
approach
At this time, I am using a simple loop, combined with fancy indexing:
output_array = np.zeros_like(input_array)
for key in mapping:
output_array[input_array==key] = mapping[key]
problem
My arrays have dimensions of 2000 by 2000, the dictionaries have around 1000 entries, so, these loops take forever.
question
is there a function, that simply takes an array and a mapping in the form of a dictionary (or similar), and outputs the changed values?
help is greatly appreciated!
Update:
Solutions:
I tested the individual solutions in Ipython, using
%%timeit -r 10 -n 10
input data
import numpy as np
np.random.seed(123)
sources = range(100)
outs = [a for a in range(100)]
np.random.shuffle(outs)
mapping = {sources[a]:outs[a] for a in(range(len(sources)))}
For every solution:
np.random.seed(123)
input_array = np.random.randint(0,100, (1000,1000))
divakar, method 3:
%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))
mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]
5.01 ms ± 641 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
divakar, method 2:
%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))
sidx = k.argsort() #k,v from approach #1
k = k[sidx]
v = v[sidx]
idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)
56.9 ms ± 609 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
divakar, method 1:
%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))
out = np.zeros_like(input_array)
for key,val in zip(k,v):
out[input_array==key] = val
113 ms ± 6.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
eelco:
%%timeit -r 10 -n 10
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)
143 ms ± 4.47 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
yatu
%%timeit -r 10 -n 10
keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
conds = np.array(keys)[:,None,None] == input_array
np.select(conds, choices)
157 ms ± 5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
original, loopy method:
%%timeit -r 10 -n 10
output_array = np.zeros_like(input_array)
for key in mapping:
output_array[input_array==key] = mapping[key]
187 ms ± 6.44 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
Thanks for the superquick help!
Solution 1:
Approach #1 : Loopy one with array data
One approach would be extracting the keys and values in arrays and then use a similar loop -
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))
out = np.zeros_like(input_array)
for key,val in zip(k,v):
out[input_array==key] = val
Benefit with this one over the original one is the spatial-locality of the array data for efficient data-fetching, which is used in the iterations.
Also, since you mentioned thousand large np.arrays
. So, if the mapping
dictionary stays the same, that step to get the array versions - k
and v
would be a one-time setup process.
Approach #2 : Vectorized one with searchsorted
A vectorized one could be suggested using np.searchsorted
-
sidx = k.argsort() #k,v from approach #1
k = k[sidx]
v = v[sidx]
idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)
Approach #3 : Vectorized one with mapping-array for integer keys
A vectorized one could be suggested using a mapping array for integer keys, which when indexed by the input array would lead us directly to the final output -
mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]
Solution 2:
I think the Divakar #3 method assumes that the mapping dict covers all values (or at least the maximum value) in the target array. Otherwise, to avoid index out of range errors, you have to replace the line
mapping_ar = np.zeros(k.max()+1,dtype=v.dtype)
with
mapping_ar = np.zeros(array.max()+1,dtype=v.dtype)
That adds considerable overhead.
Solution 3:
Given that you're using numpy arrays, I'd suggest you do a mapping using numpy too. Here's a vectorized approach using np.select
:
mapping = {1:2, 5:3, 8:6}
keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
# we can use broadcasting to obtain a 3x100x100
# array to use as condlist
conds = np.array(keys)[:,None,None] == input_array
# use conds as arrays of conditions and the values
# as choices
np.select(conds, choices)
array([[2, 2, 2, ..., 0, 0, 0],
[2, 2, 2, ..., 0, 0, 0],
[2, 2, 2, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
Solution 4:
The numpy_indexed library (disclaimer: I am its author) provides functionality to implement this operation in an efficient vectorized maner:
import numpy_indexed as npi
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)
Note; I didnt test it; but it should work along these lines. Efficiency should be good for large inputs, and many items in the mapping; I imagine similar to divakars' method 2; not as fast as his method 3. But this solution is aimed more at generality; and it will also work for inputs which are not positive integers; or even nd-arrays (f.i. replacing colors in an image with other colors, etc).