Find index where elements change value numpy

Suppose I have

>>> v
array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])

Is there an efficient numpy way to find each index where the value changes? For instance, I would want some result like,

>>> index_of_changed_values(v)
[0, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16]

If this is not possible with some numpy routine, what is a fast way to do it in python? It would also be useful to me to be referred to some good numpy tutorials since I am a numpy beginner.


Solution 1:

You can get this functionality in numpy by comparing each element with it's neighbor;

v[:-1] != v[1:]


array([False, False, False, False,  True, False, False,  True,  True,
    True,  True,  True,  True,  True,  True,  True, False, False], dtype=bool)

to get the indices you use the "where" function

np.where(v[:-1] != v[1:])[0]

array([ 4,  7,  8,  9, 10, 11, 12, 13, 14, 15])

From here you can prepend the first element and add a one to get to the same indexing scheme you have in your question.

Solution 2:

Similar to @kith answer, but requires less massaging of the result:

np.where(np.roll(v,1)!=v)[0]

No need to prepend 0 or add 1. Example:

>>> v=np.array([1, 1, 1, 2, 2, 3, 3, 4, 4, 4])
>>> np.where(np.roll(v,1)!=v)[0]
array([0, 3, 5, 7])

EDIT: as @Praveen mentioned, this fails when the last and the first elements are equal.

Solution 3:

Almost ten years later, but I came across this one today.

@kith answer is good, but may not be as neat as we want (also taking into account the steps not explicit in the answer).

that answer in the complete form would be,

v = np.array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])
np.concatenate((np.array([0]),np.where(v[:-1] != v[1:])[0]+1),axis=0)

An alternative I like more is,

np.where(np.diff(v,prepend=np.nan))[0]

which also returns

array([ 0,  5,  8,  9, 10, 11, 12, 13, 14, 15, 16], dtype=int64)

As I said, the idea is the same as @kith's but,

  • I replace v[:-1] != v[1:] for np.diff(), then in np.where the array is casted to boolean, this does not change much but seems neater.
  • I removed the extra step of adding 1 and prepending 0. This is done by prepending np.nan before doing np.diff(). The first element of the diff output will then be np.nan, and in python np.nan always evaluates True.

Solution 4:

Great question and answers!

I am working with a vector with about 1 million monotonically non-decreasing integers running from 1 to 100,000 (e.g. [1, 1, 1, 2, 3, 3, 4, ..., 100000]). For this data set, it seems there's a noticeable performance difference between the 2 idioms discussed above and whether or not the prepend kwarg is used:

%timeit np.where(np.diff(v, prepend=np.nan))                                                                                                                             
15.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.where(np.diff(v))[0] + 1                                                                                                                                     
7.41 ms ± 72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.where(v[:-1] != v[1:])[0] + 1                                                                                                                                       
2.85 ms ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the fancy-indexing invocation is 5 times faster compared to using diff() with the prepend kwarg and over twice as fast as using diff with no prepend (on my ancient MacBook Air anyway). For most use cases this performance difference isn't going to matter but I'm working with thousands of data sets like this (billions of rows total) so I need to keep performance in mind.