Find index where elements change value numpy
Suppose I have
>>> v
array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])
Is there an efficient numpy way to find each index where the value changes? For instance, I would want some result like,
>>> index_of_changed_values(v)
[0, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16]
If this is not possible with some numpy routine, what is a fast way to do it in python? It would also be useful to me to be referred to some good numpy tutorials since I am a numpy beginner.
Solution 1:
You can get this functionality in numpy by comparing each element with it's neighbor;
v[:-1] != v[1:]
array([False, False, False, False, True, False, False, True, True,
True, True, True, True, True, True, True, False, False], dtype=bool)
to get the indices you use the "where" function
np.where(v[:-1] != v[1:])[0]
array([ 4, 7, 8, 9, 10, 11, 12, 13, 14, 15])
From here you can prepend the first element and add a one to get to the same indexing scheme you have in your question.
Solution 2:
Similar to @kith answer, but requires less massaging of the result:
np.where(np.roll(v,1)!=v)[0]
No need to prepend 0 or add 1. Example:
>>> v=np.array([1, 1, 1, 2, 2, 3, 3, 4, 4, 4])
>>> np.where(np.roll(v,1)!=v)[0]
array([0, 3, 5, 7])
EDIT: as @Praveen mentioned, this fails when the last and the first elements are equal.
Solution 3:
Almost ten years later, but I came across this one today.
@kith answer is good, but may not be as neat as we want (also taking into account the steps not explicit in the answer).
that answer in the complete form would be,
v = np.array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])
np.concatenate((np.array([0]),np.where(v[:-1] != v[1:])[0]+1),axis=0)
An alternative I like more is,
np.where(np.diff(v,prepend=np.nan))[0]
which also returns
array([ 0, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype=int64)
As I said, the idea is the same as @kith's but,
- I replace
v[:-1] != v[1:]
fornp.diff()
, then innp.where
the array is casted to boolean, this does not change much but seems neater. - I removed the extra step of adding 1 and prepending 0. This is done by prepending
np.nan
before doingnp.diff()
. The first element of the diff output will then benp.nan
, and in python np.nan always evaluatesTrue
.
Solution 4:
Great question and answers!
I am working with a vector with about 1 million monotonically non-decreasing integers running from 1 to 100,000 (e.g. [1, 1, 1, 2, 3, 3, 4, ..., 100000]
). For this data set, it seems there's a noticeable performance difference between the 2 idioms discussed above and whether or not the prepend kwarg is used:
%timeit np.where(np.diff(v, prepend=np.nan))
15.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.where(np.diff(v))[0] + 1
7.41 ms ± 72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.where(v[:-1] != v[1:])[0] + 1
2.85 ms ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the fancy-indexing invocation is 5 times faster compared to using diff() with the prepend kwarg and over twice as fast as using diff with no prepend (on my ancient MacBook Air anyway). For most use cases this performance difference isn't going to matter but I'm working with thousands of data sets like this (billions of rows total) so I need to keep performance in mind.