高维数据中的最近邻? [英] Nearest neighbors in high-dimensional data?

查看:28
本文介绍了高维数据中的最近邻?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前我问过一个问题关于如何找到给定向量的最近邻居.我的向量现在是 21 维,在我继续之前,因为我既不是机器学习领域的人,也不是数学领域的人,我开始问自己一些基本问题:

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:

  • 欧几里得距离是寻找最近邻居的好指标吗?如果没有,我有哪些选择?
  • 此外,如何确定确定 k 邻域的正确阈值?是否可以进行一些分析来计算出这个值?
  • 以前,有人建议我使用 kd-Trees,但 Wikipedia 页面明确指出,对于高维,kd-Tree 几乎等同于蛮力搜索.在这种情况下,在百万点数据集中有效地找到最近邻点的最佳方法是什么?

有人可以澄清上述部分(或全部)问题吗?

Can someone please clarify the some (or all) of the above questions?

推荐答案

我目前正在研究这样的问题——分类、最近邻搜索——用于音乐信息检索.

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.

您可能对近似最近邻 (ANN) 算法感兴趣.这个想法是你允许算法返回足够的近邻(可能不是最近邻);这样做可以降低复杂性.你提到了kd-tree;这是一个例子.但是正如您所说,kd-tree 在高维度上效果不佳.事实上,所有当前的索引技术(基于空间分区)都退化为线性搜索以获得足够高的维度 [1][2][3].

You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].

在最近提出的 ANN 算法中,也许最受欢迎的是 Locality-Sensitive Hashing (LSH),它在将一个高维空间转化为一组 bin,即一个哈希表 [1][3].但与传统散列不同的是,位置敏感散列将附近点放入同一个 bin 中.

Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.

LSH 有一些巨大的优势.首先,它很简单.您只需计算数据库中所有点的哈希值,然后根据它们制作哈希表.要查询,只需计算查询点的哈希值,然后从哈希表中检索同一 bin 中的所有点.

LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.

其次,有严格的理论支持其性能.可以看出,查询时间在数据库规模上次线性,即比线性搜索快.多快取决于我们可以容忍多少近似值.

Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.

最后,LSH0 < 的任何 Lp 范数兼容.p <= 2.因此,要回答您的第一个问题,您可以将 LSH 与欧几里得距离度量一起使用,也可以将它与曼哈顿 (L1) 距离度量一起使用.汉明距离和余弦相似度也有变体.

Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.

马尔科姆·斯兰尼和迈克尔·凯西于 2008 年为 IEEE 信号处理杂志撰写了一篇不错的概述 [4].

A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].

LSH 似乎无处不在.您可能想尝试一下.

LSH has been applied seemingly everywhere. You may want to give it a try.

[1] Datar、Indyk、Immorlica、Mirrokni,基于 p 稳定分布的局部敏感哈希方案",2004 年.

[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.

[2] Weber、Schek、Blott,高维空间中相似性搜索方法的定量分析和性能研究",1998 年.

[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.

[3] Gionis、Indyk、Motwani,通过哈希在高维度上进行相似性搜索",1999 年.

[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.

[4] Slaney, Casey,用于查找最近邻居的局部敏感哈希",2008 年.

[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

这篇关于高维数据中的最近邻?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆