Finding difference within grouped dataframe in python

I have this dataframe:

                           Value      ID
          Timestamp
-----------------------------------------
2018-07-03 02:19:28          45      111
2018-07-03 02:19:29          36      111
2018-07-03 02:19:30          64      111
2018-07-03 02:19:31          35      111
2018-07-03 02:19:32          22      111 
...            
2018-07-03 03:43:14          35      232 
2018-07-03 03:43:15          44      232
2018-07-03 03:43:16          64      232
2018-07-03 03:43:17          44      232
2018-07-03 03:43:18          64      232
...
2018-07-03 05:20:28          35      555
2018-07-03 05:21:28          44      555
2018-07-03 05:22:28          75      555 
2018-07-03 05:19:28          84      555
2018-07-03 05:19:28          35      555 
...

Here, each ID represents a different "subset" of the total dataset. And so ID 111 is its own time series dataset, 232 is its own time series dataset, and 555 is its own time series dataset, with many more not shown. What I want to do, using python, is for each of these data subsets find the number of peaks and valleys based on values in the "Values" column, and then append that to the original dataframe like so:

                          Value      ID       Curve_Changes
          Timestamp
------------------------------------------------------------
2018-07-03 02:19:28          45      111                  4
2018-07-03 02:19:29          36      111                  4
2018-07-03 02:19:30          64      111                  4
2018-07-03 02:19:31          35      111                  4
2018-07-03 02:19:32          22      111                  4  
...             
2018-07-03 03:43:14          35      232                  9    
2018-07-03 03:43:15          44      232                  9
2018-07-03 03:43:16          64      232                  9
2018-07-03 03:43:17          44      232                  9
2018-07-03 03:43:18          64      232                  9
...
2018-07-03 05:20:28          35      555                 12
2018-07-03 05:21:28          44      555                 12
2018-07-03 05:22:28          75      555                 12 
2018-07-03 05:19:28          84      555                 12
2018-07-03 05:19:28          35      555                 12 
...

Based on this ideal output example dataframe, this would mean that if you were to plot the time series data subset corresponding to ID 111, you would see 4 curve changes (whether a peak or valley), and if you were to plot the time series data subset corresponding to ID 232, you would see 9 curve changes (whether a peak or valley), etc.

I am trying to use this code to find the number of peaks and valleys:

slopes = df["Value"].diff().bfill()
signs = slopes > 0
changes = signs.astype(float).diff(periods=-1).fillna(0)
num_changes = changes.abs().sum()

where num_changes is that number of curve changes I want. I am able to get this to work on the dataframe as a whole, but I am confused by how I can get this to work for each individual time series data subset so as to produce the ideal output example dataframe I showed above. I am not sure how this should be organized, but I am thinking this will be a .groupby() type task, where I think I will need to "groupby" the "ID" column, but I am not sure. How can I group by dataframe by the data subsets and find the number of curve changes for each subset and match those to the orginal dataframe?


Solution 1:

Use GroupBy.transform here for apply solution per groups to new column:

def f(x):
    #for debug
    print (x)
    slopes = x.diff().bfill()
    #for debug
    print (slopes)
    signs = slopes > 0
    changes = signs.astype(float).diff(periods=-1).fillna(0)
    return changes.abs().sum()

df['Curve_Changes'] = df.groupby('ID')['Value'].transform(f)