Numpy drop all rows for which the values are between 5% and 95% percentile
I'm looking for a pythonic way of selecting all rows from a 2D dataset of size (nrows, ncols)
in such way that I would like to keep only those rows for which all the values along fall between 5% and 95% percentile values. If we use np.percentile(dataset, 5, axis=0)
, we obtain an array of values with the size ncols
.
In the case of 1D arrays, writing something like X[X>0]
is trivial. What is the approach when you want to generalize to 2D or higher dimensions?
X[X>np.percentile(dataset, 5, axis=0)]
Solution 1:
If I understand correctly in your 2D example one can use np.all() to find the rows, where the criteria is satisfied. Then you can use the syntax like X[X>0]
(see below for an example).
I am not sure how to generalize to higher dimensions, but maybe np.take (https://numpy.org/doc/stable/reference/generated/numpy.take.html) is what you are looking for?
2D example:
# Setup
import numpy as np
np.random.seed(100)
dataset = np.random.normal(size=(10,2))
display(dataset)
array([[-1.74976547, 0.3426804 ],
[ 1.1530358 , -0.25243604],
[ 0.98132079, 0.51421884],
[ 0.22117967, -1.07004333],
[-0.18949583, 0.25500144],
[-0.45802699, 0.43516349],
[-0.58359505, 0.81684707],
[ 0.67272081, -0.10441114],
[-0.53128038, 1.02973269],
[-0.43813562, -1.11831825]])
# Indexing
lo = np.percentile(dataset, 5, axis=0)
hi = np.percentile(dataset, 95, axis=0)
idx = (lo < data) & (hi > data) # turns into a 1d index
dataset[np.all(idx, axis=1)]
array([[-1.74976547, 0.3426804 ],
[ 1.1530358 , -0.25243604],
[-0.45802699, 0.43516349],
[ 0.67272081, -0.10441114],
[-0.53128038, 1.02973269]])