How to find the indices of the top 10,000 elements in a symmetric matrix(12k X 12k) in R
I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.
listk <- numeric(0)
for( i in 1:10000) {
idx <- which(matr == max(matr), arr.ind=T)
if( length(idx) != 0) {
listk <- rbind( listk, idx[1,])
matr[idx[1,1], idx[1,2]] <- 0
matr[idx[2,1], idx[2,2]] <- 0
}
}
Here's how you might find the indices (ij
) of the 4 largest elements in a 10-by-10 matrix m
.
## Sample data
m <- matrix(runif(100), ncol=10)
## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
# row col
# [1,] 2 1
# [2,] 5 1
# [3,] 6 2
# [4,] 3 10
## Use the indices to extract the values
m[ij]
# [1] 0.9985190 0.9703268 0.9836373 0.9914510
Edit:
For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:
v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
# user system elapsed
# 4.35 0.03 4.38
system.time(b <- -sort(-v, partial=10000)[10000])
# user system elapsed
# 0.60 0.09 0.69
a==b
# [1] TRUE
I like @JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)
I start by including the appropriate headers and using namespaces for convenience
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
Then arrange to expose my C++ function to R
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
and define some variables, most importantly a priority_queue
to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
and return the result to R
return wrap(result);
This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast
> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
user system elapsed
0.992 0.000 0.998
Problems with this code include:
The default comparator
greater<Elt>
works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.
The function only works for
NumericVector
input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.
Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.
Here's the full code:
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}
A bit late into the party, but I came up with this, which avoids the sort.
Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.
find_n_top_elements <- function( x, n ){
#set the quantile to correspond to n top elements
quant <- n / (dim(x)[1]*dim(x)[2])
#select the cutpoint to get the quantile above quant
lvl <- quantile(x, probs=1.0-quant)
#select the elements above the cutpoint
res <- x[x>lvl[[1]]]
}
#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)
system.time( res <- find_n_top_elements( x, 10e3 ) )
Resulting in
system.time( res <- find_n_top_elements( x, 10e3 ) )
user system elapsed
3.47 0.42 3.89
For comparison, just sorting x on my system takes
system.time(sort(x))
user system elapsed
30.69 0.21 31.33
Matrix in R is like a vector.
mat <- matrix(sample(1:5000, 10000, rep=T), 100, 100)
mat.od <- order(mat, decreasing = T)
mat.od.arr <- cbind(mat.od%%nrow(mat), mat.od%/%nrow(mat)+1)
mat.od.arr[,2][mat.od.arr[,1]==0] <- mat.od.arr[,2][mat.od.arr[,1]==0] - 1
mat.od.arr[,1][mat.od.arr[,1]==0] <- nrow(mat)
head(mat.od.arr)
# [,1] [,2]
# [1,] 58 5
# [2,] 59 72
# [3,] 38 22
# [4,] 23 10
# [5,] 38 14
# [6,] 90 15
mat[58, 5]
# [1] 5000
mat[59, 72]
# [1] 5000
mat[38, 22]
# [1] 4999
mat[23, 10]
# [1] 4998