Find closest value in a vector with binary search

As a silly toy example, suppose

x=4.5
w=c(1,2,4,6,7)

I wonder if there is a simple R function that finds the index of the closest match to x in w. So if foo is that function, foo(w,x) would return 3. The function match is the right idea, but seems to apply only for exact matches.

Solutions here (e.g. which.min(abs(w - x)), which(abs(w-x)==min(abs(w-x))), etc.) are all O(n) instead of log(n) (I'm assuming that w is already sorted).


Solution 1:

R>findInterval(4.5, c(1,2,4,5,6))
[1] 3

will do that with price-is-right matching (closest without going over).

Solution 2:

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3

Solution 3:

See match.closest() from the MALDIquant package:

> library(MALDIquant)
> match.closest(x, w)
[1] 3

Solution 4:

NearestValueSearch = function(x, w){
  ## A simple binary search algo
  ## Assume the w vector is sorted so we can use binary search
  left = 1
  right = length(w)
  while(right - left > 1){
    middle = floor((left + right) / 2)
    if(x < w[middle]){
      right = middle
    }
    else{
      left = middle
    }
  }
  if(abs(x - w[right]) < abs(x - w[left])){
    return(right)
  }
  else{
    return(left)
  }
}


x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3

Solution 5:

x = 4.5
w = c(1,2,4,6,7)

closestLoc = which(min(abs(w-x)))
closestVal = w[which(min(abs(w-x)))]

# On my phone- please pardon typos

If your vector is lengthy, try a 2-step approach:

x = 4.5
w = c(1,2,4,6,7)

sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))

for maddeningly long vectors (millions of rows!, warning- this will actually be slower for data which is not very, very, very large.)

require(doMC)
registerDoMC()

closestLoc = which(min(foreach(i = w) %dopar% {
   abs(i-x)
}))

This example is just to give you a basic idea of leveraging parallel processing when you have huge data. Note, I do not recommend you use it for simple & fast functions like abs().