如何在大数据中进行模糊搜索 [英] how to do fuzzy search in big data

查看:155
本文介绍了如何在大数据中进行模糊搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的,我想知道最先进的是什么,我可以读到它。



我们假设我只有一个键/值存储,我有一些距离(key1,key2)以某种方式定义(不确定如果它必须是一个度量,即如果三角不等式必须保持总是)。



我想要的大多是一个搜索(键)函数,它返回所有项目的键到一定距离的搜索键。也许这个距离限制是可配置的。也许这也只是一个惰性迭代器。也许还可以存在计数限制,并且项目(关键字,值)在返回集合中具有一定概率P,其中P = 1 /距离(关键字,搜索关键字)等(即,完全匹配肯定会是在 http://musicbrainz.org =noreferrer> MusicBrainz 。他们使用 AcoustId 指纹,并定义了此比较函数。他们使用PostgreSQL GIN索引,我猜(虽然我还没有完全理解/读acoustid服务器代码) GIN部分匹配算法但我还没有完全理解这是我要求的和它是如何工作的。






对于文本,我发现到目前为止使用一些语音算法,以简化基于他们的发音的单词。例如此处 。这主要是为了将搜索空间打破到更小的空间。然而,这具有若干限制,



但是无论如何,我还在寻找一个更通用的解决方案,如果存在。

解决方案

没有(快速)通用解决方案,每个应用程序都需要不同的方法。



这两个示例都不实际执行传统的最近邻搜索。 AcoustID(我是作者)只是寻找完全匹配,但它搜索非常大量的哈希,希望其中一些将匹配。语音搜索示例使用metaphone将单词转换为其语音表示,并且也只寻找完全匹配。



你会发现,如果你有很多数据,搜索使用巨大的哈希表是唯一的事情,你可以现实做。



一个常见的方法是使用 locality-sensitive hashing (LSH)使用智能哈希方法,但正如你可以看到在你的两个例子,有时你可以逃避甚至更简单的方法。



Btw,你是专门寻找文本搜索,最简单的方法,你可以把输入分割到 N-gram 并索引那些。根据你的距离函数的定义,这可能给你正确的候选人匹配,而没有太多的工作。


I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.

Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).

What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).


One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.


For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.

But anyway, I am also searching for a more generic solution, if that exists.

解决方案

There is no (fast) generic solution, each application will need different approach.

Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.

You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.

A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.

Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.

这篇关于如何在大数据中进行模糊搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆