Efficient way to extract data from NETCDF files

I have a number of coordinates (roughly 20000) for which I need to extract data from a number of NetCDF files each comes roughly with 30000 timesteps (future climate scenarios). Using the solution here is not efficient and the reason is the time spent at each i,j to convert "dsloc" to "dataframe" (look at the code below). ** an example NetCDF file could be download from here **

import pandas as pd
import xarray as xr
import time

#Generate some coordinates
coords_data = [{'lat': 68.04, 'lon': 15.20, 'stid':1},
    {'lat':67.96, 'lon': 14.95, 'stid': 2}]
crd= pd.DataFrame(coords_data)
lat = crd["lat"]
lon = crd["lon"]
stid=crd["stid"]

NC = xr.open_dataset(nc_file)
point_list = zip(lat,lon,stid)
start_time = time.time()
for i,j,id in point_list:
    print(i,j)
    dsloc = NC.sel(lat=i,lon=j,method='nearest')
    print("--- %s seconds ---" % (time.time() - start_time))
    DT=dsloc.to_dataframe()
    DT.insert(loc=0,column="station",value=id)
    DT.reset_index(inplace=True)
    temp=temp.append(DT,sort=True)
    print("--- %s seconds ---" % (time.time() - start_time))

which results is:

68.04 15.2
--- 0.005853414535522461 seconds ---
--- 9.02660846710205 seconds ---
67.96 14.95
--- 9.028568267822266 seconds ---
--- 16.429600715637207 seconds ---

which means each i,j takes around 9 seconds to process. Given lots of coordinates and netcdf files with large timesteps, I wonder if there a pythonic way that the code could be optimized. I could also use CDO and NCO operators but I found a similar issue using them too.


This is a perfect use case for xarray's advanced indexing using a DataArray index.

# Make the index on your coordinates DataFrame the station ID,
# then convert to a dataset.
# This results in a Dataset with two DataArrays, lat and lon, each
# of which are indexed by a single dimension, stid
crd_ix = crd.set_index('stid').to_xarray()

# now, select using the arrays, and the data will be re-oriented to have
# the data only for the desired pixels, indexed by 'stid'. The
# non-indexing coordinates lat and lon will be indexed by (stid) as well.
NC.sel(lon=crd_ix.lon, lat=crd_ix.lat, method='nearest')

Other dimensions in the data will be ignored, so if your original data has dimensions (lat, lon, z, time) your new data would have dimensions (stid, z, time).