近邻的运行时间比较查询在不同的数据结构 [英] Comparison of the runtime of Nearest Neighbor queries on different data structures

查看:176
本文介绍了近邻的运行时间比较查询在不同的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定n在d维空间中的点,有几个数据结构,例如KD-树木,四叉树等来索引点。关于这些数据结构也能够实现直接的算法的周围给定的输入点近邻查询。 有一本书,论文,调查,...这比较了不同的数据结构的近邻查询的理论(主要是预期)运行? 这些数据我在看由相当小的点云数据,因此可以在主存储器中的所有处理。为简单起见,我假设数据被均匀地分布。也就是说,即时通讯是在现实世界中的表现不感兴趣,而是理论成果

解决方案

您让点的尺寸不确定的,你只要给一个近似的点数。这是什么小手段?这是相对什么一个人是指小。

你搜索什么,当然是不存在的。你的问题是pretty的多少这样的:


问题

对于一个小的(无论是否小手段给你)的数据集,任意尺寸与遵循均匀分布的数据,什么是最佳的数据结构使用?

答案

有没有这样的数据结构。


那岂不是太奇怪会对那个答案吗?假的类比是把作为这个问题,的代名词,这是最佳的编程语言?问题是大多数的第一年本科生有。你的问题是不是幼稚,但它走在相同的轨道上。


为什么没有这样的数据结构?

由于,数据集的尺寸是可变的。这意味着,你可以在2个维度的数据集,但它也可能意味着你必须在1000的尺寸,甚至更好的1000尺寸的数据集的数据集,用的征维是远小于1000。想想吧,或许有提出一个数据结构,将表现为三个数据集我提过同样出色?我对此表示怀疑。

事实上,有一些表现得真的很好低维(四叉树和KD树为例),而其他人在做更高的尺寸(RKD树森林为例)。

更好的一些数据结构

而且,用于最近邻搜索的算法和期望的的依赖于数据集的维数(以及数据集的大小和查询的性质(例如一个查询,它是太多从数据集或等距离的数据集的点可能会导致缓慢的搜索性能))。

在较低的层面,一会执行k最近邻(K-NN)搜索。在更高的层面,这将是更明智的执行K-NN近似搜索。在这种情况下,我们按照下面的权衡:

速度VS精度

什么情况是,我们实现更快的执行程序,以牺牲我们的结果是正确的。换句话说,我们的搜索程序会比较快,但它可能(这种可能性取决于许多参数,如你的问题的性质和您所使用的库)的不可以返回真NN,但确切的NN的近似。例如,它可能没有找到确切NN,但第三NN到查询点。您还可以查看近似-NN-搜索的维基标记。

为什么不一直在寻找确切的神经网络?因为维数的诅咒,这导致在较低维度中提供的解决方案,以表现为良好的作为蛮力会做(在数据集中的每个查询搜索所有点)。


您看我的答案已经有了很大的,所以我要在这里停止。你的问题太广,但很有趣,我必须承认。 :)


总之,这将是最佳的数据结构(和算法)使用依赖于你的问题。你正在处理的数据集的大小,尺寸和点的固有维度中发挥关键作用。的数量和所述查询的性质也扮演着重要的角色。

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point. Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures? The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results

解决方案

You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.

What you search, of course, doesn't exist. Your question is pretty much this:


Question:

For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?

Answer:

There is no such data structure.


Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.


Why there is no such data structure?

Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.

In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).

Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).

In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:

Speed VS accuracy

What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.

Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).


You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)


In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.

这篇关于近邻的运行时间比较查询在不同的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆