scipy.interpolate.LinearNDInterpolator无限期挂在大型数据集上 [英] scipy.interpolate.LinearNDInterpolator hangs indefinitely on large data sets

查看:240
本文介绍了scipy.interpolate.LinearNDInterpolator无限期挂在大型数据集上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python内插一些数据以将其重新注册到常规网格上,以便可以部分集成它们.数据表示高维参数空间的函数(当前为3,将扩展为至少5),并返回可观察值的多值函数(当前为2,将扩展为3,然后可能是数十个).

I'm interpolating some data in Python to regrid it on a regular mesh such that I can partially integrate it. The data represents a function of a high dimension parameter space (presently 3, to be extended to at least 5) and returns a multi-valued function of observables (presently 2, to be extended to 3 and then potentially dozens).

我正在通过 scipy.interpolate.LinearNDInterpolator执行插值缺少任何其他明显的选择(并且由于我理解griddata仍然可以调用它).在较小的数据集(15,000行的列数据)上,它可以正常工作.在较大的集合(60,000+)上,该命令似乎无限期运行. top表示iPython使用100%CPU,并且终端完全不响应,包括对C-c的响应.到目前为止,我已经花了几个小时无济于事,最终我想传递几百万个条目.

I'm performing the interpolation via scipy.interpolate.LinearNDInterpolator for lack of any other apparent options (and because I understand griddata just calls it anyway). On a smallish data set (15,000 lines of columned data) it works okay. On larger sets (60,000+), the command appears to run indefinitely. top indicates that iPython is using 100% CPU and the terminal is completely unresponsive, including to C-c. So far I've left it a few hours to no avail and ultimately I'd like to pass several million entries.

我怀疑该问题与此票证但这据说是在SciPy 0.10.0中修复的,我昨天将其升级到该版本.

I suspect the issue is related to this ticket but that was supposedly patched in SciPy 0.10.0, to which I upgraded yesterday.

我的问题基本上是如何对大型数据集执行多维插值?根据我的尝试,可能有一些解决方案的地方,但我没有运气找到它们. (我的搜索没有发现scipy的多个子域似乎已关闭 ...)

My question is basically how do I perform multi-dimensional interpolation on large data sets? Based on what I've tried, there are a few possible places a solution could come from but I haven't had any luck finding them. (My search isn't helped by the fact that several of scipy's subdomains seem to be down...)

  • LinearNDInterpolator怎么了?或者,至少,我如何才能找出问题所在,并设法规避困境?
  • 是否有一种重新格式化插值的方法,以便LinearNDInterpolator可以工作?也许是通过谨慎地分块数据以将其重新划分为几部分?
  • 还有其他更适合该问题的高维插值器吗? (我注意到SciPy的大多数替代方法都限于< 2D参数空间.)
  • 是否还有其他方法可以将多维数据获取到常规的用户定义的网格上?这就是我要通过插值来做的所有事情...
  • What's going wrong with LinearNDInterpolator? Or, at least, how can I find out what the issue is and try to circumvent the hanging?
  • Is there a way to reformulate the interpolation so that LinearNDInterpolator will work? Perhaps by chunking up the data prudently to regrid it in parts?
  • Are there other high-dimension interpolators that are better suited to the problem? (I note that most of SciPy's alternatives are limited to <2D parameter space.)
  • Are there other ways to get multi-dimensional data onto a regular user-defined grid? That's all I'm trying to do by interpolating...

推荐答案

问题很可能是您的数据集太大,因此无法在合理的时间内完成其Delaunay三角剖分的计算.使用从完整数据集中随机选择的较小数据子集检查scipy.spatial.Delaunay的时间刻度,以估计完整数据集计算是否在Universe结束之前完成.

The problem is most likely that your data set is simply too large, so that computing its Delaunay triangulation does not finish in an reasonable time. Check the time scaling of scipy.spatial.Delaunay using smaller data subsets randomly picked from your full data set, to estimate whether the full data set computation finishes before the universe ends.

如果原始数据位于矩形网格上,例如

If your original data is on a rectangular grid, such as

v[i,j,k,l] = f(x[i], y[j], z[k], u[l])

然后使用基于三角剖分的插值方法非常效率低.最好使用张量积插值,即通过一维插值方法依次插值每个维度:

then using a triangulation-based interpolation is very inefficient. It's better to use tensor-product interpolation, i.e., interpolate each dimension successively by a 1-D interpolation method:

import numpy as np
from scipy.interpolate import interp1d

def interp3(x, y, z, v, xi, yi, zi, method='cubic'):
    """Interpolation on 3-D. x, y, xi, yi should be 1-D
    and z.shape == (len(x), len(y), len(z))"""
    q = (x, y, z)
    qi = (xi, yi, zi)
    for j in range(3):
        v = interp1d(q[j], v, axis=j, kind=method)(qi[j])
    return v

def somefunc(x, y, z):
    return x**2 + y**2 - z**2 + x*y*z

# some input data
x = np.linspace(0, 1, 5)
y = np.linspace(0, 2, 6)
z = np.linspace(0, 3, 7)
v = somefunc(x[:,None,None], y[None,:,None], z[None,None,:])

# interpolate
xi = np.linspace(0, 1, 45)
yi = np.linspace(0, 2, 46)
zi = np.linspace(0, 3, 47)
vi = interp3(x, y, z, v, xi, yi, zi)

import matplotlib.pyplot as plt
plt.subplot(121)
plt.pcolor(xi, yi, vi[:,:,12])
plt.title('interpolated')
plt.subplot(122)
plt.pcolor(xi, yi, somefunc(xi[:,None], yi[None,:], zi[12]))
plt.title('exact')
plt.show()

如果您的数据集分散且对于基于三角剖分的方法而言过大,则需要切换到其他方法.一些选项是一次处理少量最近邻居的插值方法(可以使用k-d树快速检索此信息).逆距离称重是其中之一,但它可能是更差的一种---可能有更好的选择(我不做进一步的研究就不知道).

If your data set is scattered and too large for triangulation-based methods, then you need to switch to a different method. Some options are interpolation methods dealing with a small number of nearest neighbors at once (this information can be retrieved fast with a k-d-tree). Inverse distance weighing is one of these, but it may be one of the worse ones --- there are possible better options (which I don't know without further research).

这篇关于scipy.interpolate.LinearNDInterpolator无限期挂在大型数据集上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆