高维最近邻居搜索的最佳数据结构 [英] Best data structure for high dimensional nearest neighbor search

查看:133
本文介绍了高维最近邻居搜索的最佳数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上是在处理高维数据(约50.000-100.000个特征),因此必须对其进行最近邻居搜索。我知道随着尺寸的增长,KD-Trees的性能会很差,而且我已经读到,一般来说,所有空间分区数据结构都倾向于对高维数据执行穷举搜索。

I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.

另外,还需要考虑两个重要事实(按相关性排序):

Additionally, there are two important facts to be considered (ordered by relevance):


  • 精度:必须找到最近的邻居(而不是近似值)。

  • 速度:搜索速度必须与可能。 (创建数据结构的时间并不重要)。

  • Precision: The nearest neighbors must be found (not approximations).
  • Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).

因此,我需要一些有关以下方面的建议:

So, I need some advice about:


  1. 执行k-NN的数据结构。

  2. 如果使用aNN(最好是最近的)会更好


推荐答案


我可以在高维空间中执行NN搜索吗?

Can I perform NN search in high dimensional space?

否。 由于维数的诅咒,执行近邻搜索的数据结构在较低维数方面表现良好,而在高维数据中表现不佳。事实上,查询时间几乎等于蛮力查询时间,因此毫无价值。

No. Because of the curse of dimensionality, data structures that perform Nearest Neighbour search good in lower dimensions, fail to perform well in a high dimensional place. As a matter of fact, the query times becomes almost equally to the brute force one, thus it is worthless.

因此,在高维空间中,进行 最接近的邻居 (ANN)搜索。老实说,这是必须

As a result, in a high dimensional space, one should go for Approximate Nearest Neighbour (ANN) search. To be honest, it's a must.


哪种数据结构可以执行ANN?

Which data structure to perform ANN?

我建议使用LSH或许多RKD树。在我的 answer 在这里,我提到了一些不错的库,它们可以在C ++中执行ANN。但是,请注意,LSH解决了R最近邻问题,因此您指定了参数R,它实际上是半径。然后,LSH将在查询点中查找R内的NN,因此您不能真正请求 k NN。

I would suggest LSH, or a number of RKD trees. In my answer here, I mention some good libraries that perform ANN in C++. However, notice that LSH solved the R-nearest neighbour problem, thus you specify the parameter R, which is actually the radius. Then, LSH will look for NN's inside that R from the query point, thus you can't really request k NN's.

RKD树可以做到这一点,并返回您 k NN。我有一个项目,该项目构建RKD树的林并在C ++中执行ANN搜索,但它仅针对高维度。它可以处理960个维度中<^;的10 ^ 6图像的GIST数据集。 1秒,大约90%的输出是最近的邻居。名称为 kd-GeRaF 。它将在下个月使用分布式版本进行更新,但已经过测试并可以使用。它还有一个可爱的徽标。 :)

On the other hand, RKD trees can do that and return you k NN's. I have a project, that builds a forest of RKD trees and performs ANN search in C++, but it targets only high dimensions. It can handle GIST datasets of 10^6 images in 960 dimensions in < 1 sec with about 90% outputs being true nearest neighbours. The name is kd-GeRaF. It will be updated in the next month with a distributed version, but it's already tested and ready to use. It also has a cute logo. :)

我还认为您应该阅读我的答案,它表示最佳数据结构取决于数据。

I also feel that you should read my answer, which says that the optimal data structure depends on the data.

这篇关于高维最近邻居搜索的最佳数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆