Numpy drop all rows for which the values are between 5% and 95% percentile

I'm looking for a pythonic way of selecting all rows from a 2D dataset of size (nrows, ncols) in such way that I would like to keep only those rows for which all the values along fall between 5% and 95% percentile values. If we use np.percentile(dataset, 5, axis=0), we obtain an array of values with the size ncols.

In the case of 1D arrays, writing something like X[X>0] is trivial. What is the approach when you want to generalize to 2D or higher dimensions? X[X>np.percentile(dataset, 5, axis=0)]


Solution 1:

If I understand correctly in your 2D example one can use np.all() to find the rows, where the criteria is satisfied. Then you can use the syntax like X[X>0] (see below for an example).

I am not sure how to generalize to higher dimensions, but maybe np.take (https://numpy.org/doc/stable/reference/generated/numpy.take.html) is what you are looking for?

2D example:

# Setup
import numpy as np
np.random.seed(100)
dataset = np.random.normal(size=(10,2))
display(dataset)

array([[-1.74976547,  0.3426804 ],
       [ 1.1530358 , -0.25243604],
       [ 0.98132079,  0.51421884],
       [ 0.22117967, -1.07004333],
       [-0.18949583,  0.25500144],
       [-0.45802699,  0.43516349],
       [-0.58359505,  0.81684707],
       [ 0.67272081, -0.10441114],
       [-0.53128038,  1.02973269],
       [-0.43813562, -1.11831825]])

# Indexing
lo = np.percentile(dataset, 5, axis=0)
hi = np.percentile(dataset, 95, axis=0)
idx = (lo < data) & (hi > data) # turns into a 1d index

dataset[np.all(idx, axis=1)]

array([[-1.74976547,  0.3426804 ],
       [ 1.1530358 , -0.25243604],
       [-0.45802699,  0.43516349],
       [ 0.67272081, -0.10441114],
       [-0.53128038,  1.02973269]])