cupy.var (variance) performance much slower than numpy.var trying to understand why

I am hoping to move my custom camera video pipeline to use video memory with a combination of numba and cupy and avoid passing data back to the host memory if at all possible. As part of doing this I need to port my sharpness detection routine to use cuda. The easiest way to do this seemed to be to use cupy as essential all I do is compute the variance of a laplace transform of each image. The trouble I am hitting is the cupy variance computation appears to be ~ 8x slower than numpy. This includes the time it takes to copy the device ndarray to the host and perform the variance computation on the cpu using numpy. I am hoping to gain a better understanding of why the variance computation ReductionKernel employed by cupy on the GPU is so much slower. I'll start by including the test I ran below.

import cupy as cp
import numpy as np
from numba import cuda
import cv2
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
    

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        x_nb = cuda.to_device(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        x_cp = cp.asarray(x_nb)
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')

The output of this script is as follows

NumPy: 100x100 0.00006 +- 0.00000
NumPy: 500x500 0.00053 +- 0.00008
NumPy: 1000x1000 0.00243 +- 0.00029
NumPy: 2000x2000 0.01073 +- 0.00136
NumPy: 5000x5000 0.07470 +- 0.00264
NumPy: 10000x10000 0.28578 +- 0.00313
...
CuPy: 100x100 0.00026 +- 0.00002
CuPy: 500x500 0.00463 +- 0.00068
CuPy: 1000x1000 0.02308 +- 0.00060
CuPy: 2000x2000 0.07876 +- 0.00172
CuPy: 5000x5000 0.50830 +- 0.02237
CuPy: 10000x10000 1.98131 +- 0.03540

I also tried using float16 and float32 instead of uint16 (as it looks like the reduction kernels used by cupy work with floats but it didn't change the disparity in any meaningful way for me).
https://github.com/cupy/cupy/blob/04bf15706474a5e79ba196e70784a147cad6e26e/cupy/_core/_routines_statistics.pyx#L542

Here are some versions associate with my working environment.

>>> numpy.__version__
'1.18.5'
>>> cupy.__version__
'9.6.0'
>>> numba.__version__
'0.53.1'
Python 3.6.9
Driver Version: 470.82.01 
CUDA Version: 11.4

Any tips about what might be causing cupy to perform so poorly would be appreciated. Also if there are any things that can be done to improve the performance I'de love to know. I tried reading about ReductionKernels to understand how to optimize them but it's over my head. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

updated results based on feedback from @myrtlecat

with dtype np.float32
CuPy (1 axis at a time): 100x100 0.00017 +- 0.00002
CuPy (1 axis at a time): 500x500 0.00041 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00097 +- 0.00003
CuPy (1 axis at a time): 2000x2000 0.00278 +- 0.00016
CuPy (1 axis at a time): 5000x5000 0.01381 +- 0.00041
CuPy (1 axis at a time): 10000x10000 0.04313 +- 0.00355
CuPy: 100x100 0.00013 +- 0.00000
CuPy: 500x500 0.00432 +- 0.00013
CuPy: 1000x1000 0.01713 +- 0.00070
CuPy: 2000x2000 0.06713 +- 0.00079
CuPy: 5000x5000 0.41975 +- 0.00259
CuPy: 10000x10000 1.69374 +- 0.01938
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00022 +- 0.00001
NumPy: 1000x1000 0.00121 +- 0.00018
NumPy: 2000x2000 0.00530 +- 0.00047
NumPy: 5000x5000 0.03432 +- 0.00179
NumPy: 10000x10000 0.14227 +- 0.00503

with dtype np.uint16
CuPy (1 axis at a time): 100x100 0.00018 +- 0.00000
CuPy (1 axis at a time): 500x500 0.00124 +- 0.00003
CuPy (1 axis at a time): 1000x1000 0.00430 +- 0.00020
CuPy (1 axis at a time): 2000x2000 0.01537 +- 0.00008
CuPy (1 axis at a time): 5000x5000 0.07413 +- 0.00330
CuPy (1 axis at a time): 10000x10000 0.29842 +- 0.01291
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00359 +- 0.00053
CuPy: 1000x1000 0.01952 +- 0.00058
CuPy: 2000x2000 0.07719 +- 0.00076
CuPy: 5000x5000 0.48284 +- 0.00169
CuPy: 10000x10000 1.96746 +- 0.04353
NumPy: 100x100 0.00006 +- 0.00002
NumPy: 500x500 0.00053 +- 0.00010
NumPy: 1000x1000 0.00224 +- 0.00016
NumPy: 2000x2000 0.00956 +- 0.00034
NumPy: 5000x5000 0.06818 +- 0.00210
NumPy: 10000x10000 0.27071 +- 0.00747

updated script based on feedback from @myrtlecat

import cupy as cp
import numpy as np
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)


dtype = np.uint16
# dtype = np.float32


def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x


def var(x):
    return mean((x - mean(x)) ** 2)


# CuPy (1 axis at a time)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (1 axis at a time): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

last updated script and results based on feedback from @myrtlecat where he said that it may be an issue with 2d arrays so I tried using reshape to flatten the arrays before calling "var"

# CuPy (flattened)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp.reshape((s * s,)))
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (flattened): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

the results

for uint16
CuPy (flattened): 100x100 0.00018 +- 0.00006
CuPy (flattened): 500x500 0.00107 +- 0.00002
CuPy (flattened): 1000x1000 0.00414 +- 0.00020
CuPy (flattened): 2000x2000 0.01550 +- 0.00036
CuPy (flattened): 5000x5000 0.10017 +- 0.00525
CuPy (flattened): 10000x10000 0.39470 +- 0.01606
CuPy (1 axis at a time): 100x100 0.00026 +- 0.00008
CuPy (1 axis at a time): 500x500 0.00104 +- 0.00005
CuPy (1 axis at a time): 1000x1000 0.00368 +- 0.00028
CuPy (1 axis at a time): 2000x2000 0.01364 +- 0.00055
CuPy (1 axis at a time): 5000x5000 0.07639 +- 0.00311
CuPy (1 axis at a time): 10000x10000 0.29405 +- 0.00419

for float32
CuPy (flattened): 100x100 0.00015 +- 0.00007
CuPy (flattened): 500x500 0.00043 +- 0.00001
CuPy (flattened): 1000x1000 0.00159 +- 0.00003
CuPy (flattened): 2000x2000 0.00631 +- 0.00030
CuPy (flattened): 5000x5000 0.03827 +- 0.00135
CuPy (flattened): 10000x10000 0.13173 +- 0.00579
CuPy (1 axis at a time): 100x100 0.00020 +- 0.00005
CuPy (1 axis at a time): 500x500 0.00030 +- 0.00004
CuPy (1 axis at a time): 1000x1000 0.00077 +- 0.00002
CuPy (1 axis at a time): 2000x2000 0.00215 +- 0.00002
CuPy (1 axis at a time): 5000x5000 0.01387 +- 0.00049
CuPy (1 axis at a time): 10000x10000 0.04099 +- 0.00142

So it seems like for both uint16 and float32 the results are faster using the 1 axis at a time technique suggested, rather than flattening the 2d array with reshape. Although the reshape is significantly faster than passing the 2d array.

Solution 1:

I have a partial hypothesis about the problem (not a full explanation) and a work-around. Perhaps someone can fill in the gaps. I've used a quicker-and-dirtier benchmark, for brevity's sake.

The work-around: reduce one axis at a time

Cupy is much faster when reduction is performed on one axis at a time. In stead of:

x.sum()

prefer this:

x.sum(-1).sum(-1).sum(-1)...

Note that the results of these computations may differ due to rounding error.

Here are faster mean and var functions:

def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

Benchmarks

import numpy as np
import cupy as cp
from timeit import timeit

def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

def benchmark(label, f):
    t = timeit(f, number=10) / 10
    print(f"{label.ljust(25)}{t:0.4f}s")

x = np.random.rand(5000, 5000).astype('f4')
x_cp = cp.array(x)

benchmark("Numpy", lambda: x.var())
benchmark("Cupy", lambda: float(x_cp.var()))
benchmark("Cupy (1 axis at a time)", lambda: float(var(x_cp)))

Yielding more than 100x speed-up:

Numpy                    0.0469s
Cupy                     0.3805s
Cupy (1 axis at a time)  0.0013s

The answers are close but not identical due to rounding:

> x.var(), float(x_cp.var()), float(var(x_cp))
(0.08334004, 0.08333975821733475, 0.08334004133939743)

Why is it faster?

CUDA-capable GPUs are generally slower than modern CPUs when operating on a single thread, but they achieve high throughput by performing many identical operations in parallel.

Reduction like sum or var can be parallelised by:

dividing the input into chunks
performing reduction on the chunks in parallel
performing a second reduction on the results for each chunk

For a sufficiently large input array, if the chunk size is chosen well, this will be close to optimal performance. (Note: for small arrays, or squeeze every last drop of performance, requires more advanced techniques like the slides OP linked to).

I think it's a bug...

I think that cupy should be applying this technique to all reductions (my reading of the cupy code is that it does do this). However, for whatever reason, it does a good job of parallelising reductions over a single axis, but not over an entire array. I doubt that this is intended behaviour, but I am not a dev on cupy and so perhaps it is. The generated CUDA code and parameters used to invoke the reduction kernels are quite buried, so I'm not sure exactly what is happening in each case.

Updated benchmarks

Result of running the updated benchmark script for uint16:

CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00029 +- 0.00000
CuPy (1 axis at a time): 1000x1000 0.00070 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00242 +- 0.00001
CuPy (1 axis at a time): 5000x5000 0.01410 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.05145 +- 0.00149
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00316 +- 0.00000
CuPy: 1000x1000 0.01250 +- 0.00001
CuPy: 2000x2000 0.07283 +- 0.00290
CuPy: 5000x5000 0.44025 +- 0.00012
CuPy: 10000x10000 1.76455 +- 0.00190
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00056 +- 0.00001
NumPy: 1000x1000 0.00201 +- 0.00001
NumPy: 2000x2000 0.01066 +- 0.00005
NumPy: 5000x5000 0.08828 +- 0.00007
NumPy: 10000x10000 0.35403 +- 0.00064

so about 7x speed-up over numpy, and and for float32:

CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00018 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00021 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00052 +- 0.00000
CuPy (1 axis at a time): 5000x5000 0.00232 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.00837 +- 0.00002
CuPy: 100x100 0.00015 +- 0.00000
CuPy: 500x500 0.00281 +- 0.00001
CuPy: 1000x1000 0.01657 +- 0.00026
CuPy: 2000x2000 0.06557 +- 0.00003
CuPy: 5000x5000 0.37905 +- 0.00007
CuPy: 10000x10000 1.51899 +- 0.01084
NumPy: 100x100 0.00003 +- 0.00000
NumPy: 500x500 0.00032 +- 0.00000
NumPy: 1000x1000 0.00115 +- 0.00001
NumPy: 2000x2000 0.00581 +- 0.00010
NumPy: 5000x5000 0.04730 +- 0.00009
NumPy: 10000x10000 0.19188 +- 0.00024

about 40x speed-up over numpy.

Results will be platform-dependent: for comparison, my CPU is Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz, and my GPU is NVIDIA GeForce GTX 1080 Ti.