如何在高维数据中高效地找到k-最近邻? [英] How to efficiently find k-nearest neighbours in high-dimensional data?

本文介绍了如何在高维数据中高效地找到k-最近邻?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有大约 16,000 个 75 维数据点,对于每个点,我想找到它的 k 个最近邻居(使用欧几里德距离,如果这样更容易的话,目前 k=2)

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

我的第一个想法是为此使用 kd 树,但事实证明,随着维度数量的增加,它们变得相当低效.在我的示例实现中,它只比穷举搜索略快.

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

我的下一个想法是使用 PCA(主成分分析)来减少维数,但我想知道:是否有一些聪明的算法或数据结构可以在合理的时间内准确地解决这个问题?

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

推荐答案

关于 kd-trees 的维基百科文章有一个链接到 ANN 库:

The Wikipedia article for kd-trees has a link to the ANN library:

ANN 是一个用 C++ 编写的库,它支持数据结构和精确算法和近似最近邻搜索任意高的维度.

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

根据我们自己的经验,ANN对点执行非常有效大小从数千到数十万,并在尺寸高达 20.(对于显着更高的应用程序维度,结果相当参差不齐,但你还是可以试试.)

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

就算法/数据结构而言:

As far as algorithm/data structures are concerned:

该库实现了许多不同的数据结构,基于kd-trees 和 盒分解树,并雇用了几个不同的搜索策略.

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

我会先直接尝试,如果这不能产生令人满意的结果,我会在应用 PCA/ICA 后将其与数据集一起使用(因为您最终不太可能获得足够多的 kd 维度)-要处理的树).

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

这篇关于如何在高维数据中高效地找到k-最近邻?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆