数据结构与算法的二维快速k近邻搜索合适的选择 [英] Suitable choice of data structure and algorithm for fast k-Nearest Neighbor search in 2D

查看:166
本文介绍了数据结构与算法的二维快速k近邻搜索合适的选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集约100000(X,Y)对重新在二维空间presenting分。对于每个的一点,我想找到它的k最近邻居。

I have a dataset of approximately 100,000 (X, Y) pairs representing points in 2D space. For each point, I want to find its k-nearest neighbors.

所以,我的问题是 - 什么样的数据结构/算法将是一个合适的选择,假设我想绝对减少整体运行时间

So, my question is - what data-structure / algorithm would be a suitable choice, assuming I want to absolutely minimise the overall running time?

我不想找code - 只是对一个合适的方法指针。我有点受的选择,似乎相应和的范围吓倒 - 四树,R树,KD树等

I'm not looking for code - just a pointer towards a suitable approach. I'm a bit daunted by the range of choices that seem relevent - quad-trees, R-trees, kd-trees, etc.

我想最好的办法是建立一个数据结构,然后运行一些K近邻寻找每一点的。然而,因为(一)我提前知道点,和(b)我知道我必须赶快寻找每一点恰好一次,也许有更好的方法吗?

I'm thinking the best approach is to build a data structure, then run some kind of k-Nearest Neighbor search for each point. However, since (a) I know the points in advance, and (b) I know I must run the search for every point exactly once, perhaps there is a better approach?

一些额外的细节:

  • 因为我希望尽量减少全程运行时间,我不在乎,如果大部分的时间都花在结构VS搜索。
  • 的(X,Y)对是相当不错US $ p $垫出来,所以我们可以假设几乎均匀分布。

推荐答案

如果k是比较小(小于20左右),你有一个大致均匀分布,创建覆盖,其中点落在范围内的网格,选择使得每个网格点的平均数是舒服于k更高(以使得位于中心的点通常会得到及其k近邻在于,一个网格点)。然后创建一个集的其他网格从第一(重叠),沿每个轴设置半断。现在,每一个点,计算该网格元素它落入(因为网格是正规,无觅处是必需的),并挑选四个一(或howevermany重叠网格,你有),有那点最接近它的中心。

If k is relatively small (<20 or so) and you have an approximately uniform distribution, create a grid that overlays the range where the points fall, chosen so that the average number of points per grid is comfortably higher than k (so that a centrally-located point will usually get its k neighbors in that one grid point). Then create a set of other grids set half-off from the first (overlapping) along each axis. Now for each point, compute which grid element it falls into (since the grids are regular, no searching is required) and pick the one of four (or howevermany overlapping grids you have) that has that point closest to its center.

在每个网格单元,该点应在一个排序的坐标(假设X)。在您选择的元素开始(发现使用二分),步行向外沿排序列表,直到找到k个项目(同样,如果k小,以最快的方式,以保持k的列表最好的命中与折半插入排序,让最糟糕的比赛脱落到底什么时候插入,插入排序通常击败一切达在现代硬件上约30个项目)。一直走,直到你最遥远的近邻是x中客场更接近你比下点从您(即不包括它们的Y偏移,所以不可能有新的点,可能会比接近第k-最接近迄今为止发现)

Within each grid element, the points should be sorted in one coordinate (let's say x). Starting at the element you chose (find it using bisection), walk outwards along the sorted list until you have found k items (again, if k is small, the fastest way to maintain a list of the k best hits is with binary insertion sort, letting the worst match fall off the end when you insert; insertion sort generally beats everything else up to about 30 items on modern hardware). Keep going until your most distant nearest neighbor is closer to you than the next points away from you in x (i.e. not counting their y-offset, so there could be no new point that could be closer than the kth-closest found so far).

如果您还没有K个点呢,还是你有k个点,但网格元素的一个或多个壁更接近你的兴趣点比最远的k个点,加上毗邻的各有关网格元素融入搜索

If you do not have k points yet, or you have k points but one or more walls of the grid element are closer to your point of interest than the farthest of the k points, add the relevant adjacent grid elements into the search.

这应该给你的东西像性能O(N * K ^ 2),以相对低的常数因子。如果k很大,那么这种策略是过于简单,你应该选择一种算法,在k个线性或对数线性,像KD树就可以了。

This should give you performance of something like O(N*k^2), with a relatively low constant factor. If k is large, then this strategy is too simplistic and you should choose an algorithm that is linear or log-linear in k, like kd-trees can be.

这篇关于数据结构与算法的二维快速k近邻搜索合适的选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆