Rolling median algorithm in C
Solution 1:
I have looked at R's src/library/stats/src/Trunmed.c
a few times as I wanted something similar too in a standalone C++ class / C subroutine. Note that this are actually two implementations in one, see src/library/stats/man/runmed.Rd
(the source of the help file) which says
\details{
Apart from the end values, the result \code{y = runmed(x, k)} simply has
\code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
efficiently.
The two algorithms are internally entirely different:
\describe{
\item{"Turlach"}{is the Härdle-Steiger
algorithm (see Ref.) as implemented by Berwin Turlach.
A tree algorithm is used, ensuring performance \eqn{O(n \log
k)}{O(n * log(k))} where \code{n <- length(x)} which is
asymptotically optimal.}
\item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
which makes use of median \emph{updating} when one observation
enters and one leaves the smoothing window. While this performs as
\eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
considerably faster for small \eqn{k} or \eqn{n}.}
}
}
It would be nice to see this re-used in a more standalone fashion. Are you volunteering? I can help with some of the R bits.
Edit 1: Besides the link to the older version of Trunmed.c above, here are current SVN copies of
-
Srunmed.c
(for the Stuetzle version) -
Trunmed.c
(for the Turlach version) -
runmed.R
for the R function calling these
Edit 2: Ryan Tibshirani has some C and Fortran code on fast median binning which may be a suitable starting point for a windowed approach.
Solution 2:
I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link suggested by MAK ( Match Editorial: scroll down to FloatingMedian).
Two multisets
The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.
Segment tree
The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!!
The c++ code is similar to what Denis posted above ("Here's a simple algorithm for quantized data")
GNU Order Statistic Trees
Just before giving up, I found that stdlibc++ contains order statistic trees!!!
These have two critical operations:
iter = tree.find_by_order(value)
order = tree.order_of_key(value)
See libstdc++ manual policy_based_data_structures_test (search for "split and join").
I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:
#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
T,
__gnu_pbds::null_type,
std::less<T>,
__gnu_pbds::rb_tree_tag,
// This policy updates nodes' metadata for order statistics.
__gnu_pbds::tree_order_statistics_node_update>;
#endif //GNU_ORDER_STATISTIC_SET_H
Solution 3:
I've done a C implementation here. A few more details are in this question: Rolling median in C - Turlach implementation.
Sample usage:
int main(int argc, char* argv[])
{
int i, v;
Mediator* m = MediatorNew(15);
for (i=0; i<30; i++) {
v = rand() & 127;
printf("Inserting %3d \n", v);
MediatorInsert(m, v);
v = MediatorMedian(m);
printf("Median = %3d.\n\n", v);
ShowTree(m);
}
}