scipy.spatial.ckdtree 运行缓慢 [英] scipy.spatial.ckdtree running slowly
问题描述
我一直在 scipy
中使用 spatial.cKDTree
来计算点之间的距离.对于我的典型数据集,它总是运行得非常快(约 1 秒)(查找约 1000 个点到约 1e6 个点的数组的距离).
I've been using spatial.cKDTree
in scipy
to calculate distances between points. It has always run very quickly (~1 s) for my typical data sets (finding distances for ~1000 points to an array of ~1e6 points).
我在装有 Ubuntu 14.10 的计算机上使用 python 2.7.6 运行此代码.直到今天早上,我已经用 apt-get
管理了大多数 python 包,包括 scipy
和 numpy
.不过,我想要一些软件包的最新版本,因此我决定通过 apt-get
将软件包安装在 /usr/lib/python2.7/
中,并且使用 pip install
重新安装所有软件包(使用 apt-get
处理 scipy
依赖项,例如 liblapack-dev
, 有必要的).一切都已安装并且可以毫无问题地导入.
I'm running this code in python 2.7.6 on a computer with Ubuntu 14.10. Up until this morning, I had managed most python packages with apt-get
, including scipy
and numpy
. I wanted up-to-date versions of a few packages though, so I decided to packages installed in /usr/lib/python2.7/
by apt-get
, and re-installed all packages with pip install
(taking care of scipy
dependencies like liblapack-dev
with apt-get
, as necessary). Everything installed and is importable without a problem.
import scipy
import cython
scipy.__version__
'0.16.0'
cython.__version__
'0.22.1'
现在,在相同大小的数据集上运行 spatial.cKDTree
真的很慢.我看到大约 500 秒的运行时间而不是大约 1 秒.我很难弄清楚发生了什么.
Now, running spatial.cKDTree
on the same size data sets is going really slowly. I'm seeing run time of ~500 s rather than ~1 s. I'm having trouble figuring out what is going on.
关于我在使用 pip
而不是 apt-get
可能会导致 scipy.spatial.cKDTree
安装时所做的任何建议> 跑得这么慢?
Any suggestions as to what I might have done in installing using pip
rather than apt-get
that would have caused scipy.spatial.cKDTree
to run so slowly?
推荐答案
在 0.16.x
中,我添加了使用中值或滑动中点规则构建 cKDTree
的选项,如以及选择是否重新计算 kd 树中每个节点的边界超矩形.默认值基于 scipy.spatial.cKDTree
和 sklearn.neighbors.KDTree
的性能经验.在一些人为的情况下(沿维度高度拉伸的数据)它可能会产生负面影响,但通常应该更快.尝试使用 balanced_tree=False
和/或 compact_nodes=False
构建 cKDTree
.将两者都设置为 False
为您提供与 0.15.x
相同的行为.不幸的是,很难设置让每个人都满意的默认值,因为性能取决于数据.
In 0.16.x
I added options to build the cKDTree
with median or sliding midpoint rules, as well as choosing whether to recompute the bounding hyperrectangle at each node in the kd-tree. The defaults are based on experiences about the performance of scipy.spatial.cKDTree
and sklearn.neighbors.KDTree
. In some contrived cases (data that are highly streched along a dimension) it can have negative impact, but usually it should be faster. Experiment with bulding the cKDTree
with balanced_tree=False
and/or compact_nodes=False
. Setting both to False
gives you the same behavior as 0.15.x
. Unfortunately it is difficult to set defaults that make everyone happy because the performance depends on the data.
另请注意,使用 balanced_tree=True
我们在构建 kd 树时通过快速选择计算中值.如果由于某种原因对数据进行了预先排序,它会很慢.在这种情况下,它将有助于打乱输入数据的行.或者您可以设置 balanced_tree=False
以避免部分快速排序.
Also note that with balanced_tree=True
we compute medians by quickselect when the kd-tree is constructed. If the data for some reason is pre-sorted, it will be very slow. In this case it will help to shuffle the rows of the input data. Or you can set balanced_tree=False
to avoid the partial quicksorts.
还有一个新选项可以对最近邻查询进行多线程处理.尝试使用 n_jobs=-1
调用 query
,看看它是否对你有帮助.
There is also a new option to multithread the nearest-neighbor query. Try to call query
with n_jobs=-1
and see if it helps for you.
2020 年 6 月更新:SciPy 1.5.0 将使用一种新算法(基于 introselect 的部分排序,来自 C++ STL)来解决这里报告的问题.
Update June 2020: SciPy 1.5.0 will use a new algorithm (introselect based partial sort, from C++ STL) which solves the problems reported here.
这篇关于scipy.spatial.ckdtree 运行缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!