使用xarray，如何并行处理多维数据集上的一维操作? [英] With xarray, how to parallelize 1D operations on a multidimensional Dataset?

查看：81 发布时间：2021/4/28 19:33:27 python dask python-xarray

本文介绍了使用xarray，如何并行处理多维数据集上的一维操作?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个4D xarray数据集.我想在特定维度(此处为时间)的两个变量之间进行线性回归，并将回归参数保留在3D数组(其余维度)中.通过使用此序列号，我设法获得了想要的结果，但是速度很慢:

I have a 4D xarray Dataset. I want to carry out a linear regression between two variables on a specific dimension (here time), and keep the regression parameters in a 3D array (the remaining dimensions). I managed to get the results I want by using this serial code, but it is rather slow:

# add empty arrays to store results of the regression
res_shape = tuple(v for k,v in ds[x].sizes.items() if k != 'year')
res_dims = tuple(k for k,v in ds[x].sizes.items() if k != 'year')
ds[sl] = (res_dims, np.empty(res_shape, dtype='float32'))
ds[inter] = (res_dims, np.empty(res_shape, dtype='float32'))
# Iterate in kept dimensions
for lat in ds.coords['latitude']:
    for lon in ds.coords['longitude']:
        for duration in ds.coords['duration']:
            locator = {'longitude':lon, 'latitude':lat, 'duration':duration}
            sel = ds.loc[locator]
            res = scipy.stats.linregress(sel[x], sel[y])
            ds[sl].loc[locator] = res.slope
            ds[inter].loc[locator] = res.intercept

如何加快和并行化此操作?

How could I speed-up and parallelize this operation?

我知道 apply_ufunc 可能是一个选项(可以与dask并行化)，但是我没有设法正确设置参数.

I understand that apply_ufunc might be an option (and could be parallelized with dask), but I did not managed to get the parameters right.

以下问题相关，但没有答案:

The following questions are related but without an answer:

将先前的编辑移至答案

推荐答案

可以使用 apply_ufunc将 scipy.stats.linregress (和其他非ufuncs)应用于xarray数据集.()通过像这样传递 vectorize = True :

It is possible to apply scipy.stats.linregress (and other non-ufuncs) to the xarray Dataset using apply_ufunc() by passing vectorize=True like so:

# return a tuple of DataArrays
res = xr.apply_ufunc(scipy.stats.linregress, ds[x], ds[y],
        input_core_dims=[['year'], ['year']],
        output_core_dims=[[], [], [], [], []],
        vectorize=True)
# add the data to the existing dataset
for arr_name, arr in zip(array_names, res):
    ds[arr_name] = arr

尽管仍然是串行的，但在这种特定情况下， apply_ufunc 比循环实现快36倍.

Although still serial, apply_ufunc is around 36x faster than the loop implementation in this specific case.

但是，仍然无法使用dask进行并行化，例如 scipy.stats.linregress 中的输出:

However the parallelization with dask is still not implemented with multiple output like the one from scipy.stats.linregress:

NotImplementedError:dask ='parallelized'尚不支持apply_ufunc的多个输出

NotImplementedError: multiple outputs from apply_ufunc not yet supported with dask='parallelized'

这篇关于使用xarray，如何并行处理多维数据集上的一维操作?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用xarray，如何并行处理多维数据集上的一维操作? [英] With xarray, how to parallelize 1D operations on a multidimensional Dataset?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用xarray，如何并行处理多维数据集上的一维操作? [英] With xarray, how to parallelize 1D operations on a multidimensional Dataset?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭