Find unique rows in numpy.array
As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:
unique_rows = np.unique(original_array, axis=0)
Yet another possible solution
np.vstack({tuple(row) for row in a})
Another option to the use of structured arrays is using a view of a void
type that joins the whole row into a single item:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
>>> unique_a
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
EDIT
Added np.ascontiguousarray
following @seberg's recommendation. This will slow the method down if the array is not already contiguous.
EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:
unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
a = np.random.randint(2, size=(10000, 6))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop
a = np.random.randint(2, size=(10000, 100))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop
If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.
As a quick example:
import numpy as np
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq
To understand what's going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)
In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(1, 1, 1, 0, 0, 0)],
[(1, 1, 1, 1, 1, 0)]],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
Once we run numpy.unique
, we'll get a structured array back:
In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
That we then need to view as a "normal" array (_
stores the result of the last calculation in ipython
, which is why you're seeing _.view...
):
In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])
And then reshape back into a 2D array (-1
is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):
In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Obviously, if you wanted to be more concise, you could write it as:
import numpy as np
def unique_rows(data):
uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
return uniq.view(data.dtype).reshape(-1, data.shape[1])
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
print unique_rows(data)
Which results in:
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
np.unique
when I run it on np.random.random(100).reshape(10,10)
returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:
array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)
That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"