Extremely large number of dask tasks for simple computation

import numpy as np
import dask.array as dda

def index_nearest(to_find, coord_grid):
    """Return indexes of elements in coord_grid that most 
    closely match the ones in to_find.
    """
    to_find = to_find.reshape(-1,1)
    return np.argmin(np.abs(to_find - coord_grid), axis=1)

n_points = 10000
n_grid = 800

x = np.random.uniform(0, n_grid, [n_points,])
x_grid = np.arange(n_grid)

x_da = dda.from_array(x, chunks=10)
x_grid_da = dda.from_array(x_grid, chunks=10)

index_nearest(x_da, x_grid_da)

When running the last line I get a warning that says the number of chunks has been increased by a factor of 40, and the repr looks like this:

enter image description here

Isn't 270,000 tasks a bit too much for such a simple computation?


Solution 1:

The number of tasks depends on the number of chunks, especially for operations that require pairwise combinations of different chunks.

Since a chunk with 10 elements takes up 80 bytes, there is usually plenty of room for using bigger chunks, e.g. increasing chunk size to 1_000 elements increases memory load of a single chunk to about 8 kilobytes, while the number of tasks to simply create x_da goes down from 1000 to 10. Using chunk size of 1_000 decreases task count for the complete operation to 61 tasks.

Chunk size of 1_000 is still rather small, in many cases one could probably get away with chunk sizes on the scale of 100 MB or even more (depends on the hardware and the types of computations performed).