scipy.spatial.ckdtree 运行缓慢 [英] scipy.spatial.ckdtree running slowly

查看:100
本文介绍了scipy.spatial.ckdtree 运行缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在 scipy 中使用 spatial.cKDTree 来计算点之间的距离.对于我的典型数据集,它总是运行得非常快(约 1 秒)(查找约 1000 个点到约 1e6 个点的数组的距离).

I've been using spatial.cKDTree in scipy to calculate distances between points. It has always run very quickly (~1 s) for my typical data sets (finding distances for ~1000 points to an array of ~1e6 points).

我在装有 Ubuntu 14.10 的计算机上使用 python 2.7.6 运行此代码.直到今天早上,我已经用 apt-get 管理了大多数 python 包,包括 scipynumpy.不过,我想要一些软件包的最新版本,因此我决定通过 apt-get 将软件包安装在 /usr/lib/python2.7/ 中,并且使用 pip install 重新安装所有软件包(使用 apt-get 处理 scipy 依赖项,例如 liblapack-dev, 有必要的).一切都已安装并且可以毫无问题地导入.

I'm running this code in python 2.7.6 on a computer with Ubuntu 14.10. Up until this morning, I had managed most python packages with apt-get, including scipy and numpy. I wanted up-to-date versions of a few packages though, so I decided to packages installed in /usr/lib/python2.7/ by apt-get, and re-installed all packages with pip install (taking care of scipy dependencies like liblapack-dev with apt-get, as necessary). Everything installed and is importable without a problem.

import scipy
import cython
scipy.__version__
'0.16.0'
cython.__version__
'0.22.1'

现在,在相同大小的数据集上运行 spatial.cKDTree 真的很慢.我看到大约 500 秒的运行时间而不是大约 1 秒.我很难弄清楚发生了什么.

Now, running spatial.cKDTree on the same size data sets is going really slowly. I'm seeing run time of ~500 s rather than ~1 s. I'm having trouble figuring out what is going on.

关于我在使用 pip 而不是 apt-get 可能会导致 scipy.spatial.cKDTree 安装时所做的任何建议> 跑得这么慢?

Any suggestions as to what I might have done in installing using pip rather than apt-get that would have caused scipy.spatial.cKDTree to run so slowly?

推荐答案

0.16.x 中,我添加了使用中值或滑动中点规则构建 cKDTree 的选项,如以及选择是否重新计算 kd 树中每个节点的边界超矩形.默认值基于 scipy.spatial.cKDTreesklearn.neighbors.KDTree 的性能经验.在一些人为的情况下(沿维度高度拉伸的数据)它可能会产生负面影响,但通常应该更快.尝试使用 balanced_tree=False 和/或 compact_nodes=False 构建 cKDTree.将两者都设置为 False 为您提供与 0.15.x 相同的行为.不幸的是,很难设置让每个人都满意的默认值,因为性能取决于数据.

In 0.16.x I added options to build the cKDTree with median or sliding midpoint rules, as well as choosing whether to recompute the bounding hyperrectangle at each node in the kd-tree. The defaults are based on experiences about the performance of scipy.spatial.cKDTree and sklearn.neighbors.KDTree. In some contrived cases (data that are highly streched along a dimension) it can have negative impact, but usually it should be faster. Experiment with bulding the cKDTree with balanced_tree=False and/or compact_nodes=False. Setting both to False gives you the same behavior as 0.15.x. Unfortunately it is difficult to set defaults that make everyone happy because the performance depends on the data.

另请注意,使用 balanced_tree=True 我们在构建 kd 树时通过快速选择计算中值.如果由于某种原因对数据进行了预先排序,它会很慢.在这种情况下,它将有助于打乱输入数据的行.或者您可以设置 balanced_tree=False 以避免部分快速排序.

Also note that with balanced_tree=True we compute medians by quickselect when the kd-tree is constructed. If the data for some reason is pre-sorted, it will be very slow. In this case it will help to shuffle the rows of the input data. Or you can set balanced_tree=False to avoid the partial quicksorts.

还有一个新选项可以对最近邻查询进行多线程处理.尝试使用 n_jobs=-1 调用 query,看看它是否对你有帮助.

There is also a new option to multithread the nearest-neighbor query. Try to call query with n_jobs=-1 and see if it helps for you.

2020 年 6 月更新:SciPy 1.5.0 将使用一种新算法(基于 introselect 的部分排序,来自 C++ STL)来解决这里报告的问题.

Update June 2020: SciPy 1.5.0 will use a new algorithm (introselect based partial sort, from C++ STL) which solves the problems reported here.

这篇关于scipy.spatial.ckdtree 运行缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆