Optimising np.searchsorted() with cython or alternative

I have the following functions. I was trying to optimize the np.searchsorted() function, and got a good result writing my own in Cython. But then realized simply wrapping the former in a cdef cfunc() works better, and I don't understand the mechanics of python, cython or numpy to know why. What would be the best, and fastest/efficient way to write this function. In my script again this is one of the functions that is called several thousand times during some computations, and I'd like the cost per call to be as low as possible. Also, additionally I don't need to duplicate np.searchsorted() entirely, I just need to produce the specific output i = np.searchsorted(time_points > t, True)

Case 1

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
cimport openmp
cimport cython
#cimport openmp
"""
%%cython -a
"""
from cpython cimport array
import array
import cython.parallel as cp
from cython.parallel import parallel, prange
import numpy as np
cimport numpy as np
import random
import pickle
import matplotlib.pyplot as pl

from timeit import default_timer as timer
cdef int ntime = 10**4
cdef float t = 2 #some variable in general 
time_points = np.arange(ntime, dtype=np.float64)
#print(time_points)
#search_time =[]
start_update = timer()
i = np.searchsorted(time_points > t, True)
print(i)
end_update = timer()
print(end_update-start_update)

Case 2

@cython.boundscheck(False)  # Deactivate bounds checking                                                                  
@cython.wraparound(False)   # Deactivate negative indexing.                                                               
@cython.cdivision(True)     # Deactivate division by 0 checking.
cdef mysearch(np.ndarray[np.float64_t, ndim=1] time_points, float val, int nt):
    cdef int idx
    cdef int return_val
    #cdef int total 
    if val >= nt or nt-1< val<nt:
        print("first")
        return nt
    else:
        for idx in range(nt):
            #print(idx)
            
            if time_points[idx] <= val:
                #return_val = np.int(time_points[idx])
                #break
                #print("sec")
                continue
            else:
                #print("third", idx)
                return_val = np.int(time_points[idx])
                #continue
                break
        
    return return_val
print(mysearch(time_points, t, ntime))
#i = np.searchsorted(time_points > 2, True)
#print(i)
end_update = timer()
b= end_update-start_update
print(b)

Case 3

start_update = timer()
cdef cfunc():
    i = np.searchsorted(time_points > t, True)
    print(i, np.searchsorted(time_points > t, True))
    return i
end_update = timer()
a=  end_update-start_update
print(a)

The times for case 1, case 2 and case 3 are 0.00013096071779727936 5.245860666036606e-05 and 6.770715117454529e-07

I am happy that case 2 works better than case 1, but why is case 3 so much faster? Is something special going on, and how can i beat those speeds?


Solution 1:

The main problem with np.searchsorted(time_points > t, True) is time_points > t. Indeed, while np.searchsorted runs in O(log n) time (with n the size of time_points), the expression time_points > t is computed in O(n) time and is clearly the bottleneck. In fact, you do not need this sub-expression: i = np.searchsorted(time_points, t, 'right') should do the job correctly. Note that the last case do not execute anything between the two timers as explained by @DavidW in the comments: it just define a Cython function.